Prompt Injection Classifier Guardrail
PromptInjectionClassifierGuardrail
Bases: Guardrail
A guardrail that uses a pre-trained text-classification model to classify prompts for potential injection attacks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model_name
|
str
|
The name of the HuggingFace model or a WandB checkpoint artifact path to use for classification. |
required |
Source code in guardrails_genie/guardrails/injection/classifier_guardrail.py
guard(prompt)
Analyzes the given prompt to determine if it is safe or potentially an injection attack.
This function uses a pre-trained text-classification model to classify the prompt.
It calls the classify
method to get the classification result, which includes a label
and a confidence score. The function then calculates the confidence percentage and
returns a dictionary with two keys:
- "safe": A boolean indicating whether the prompt is safe (True) or an injection (False).
- "summary": A string summarizing the classification result, including the label and the confidence percentage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompt
|
str
|
The input prompt to be classified. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
A dictionary containing the safety status and a summary of the classification result. |