Privilege Escalation Guardrail
OpenAIPrivilegeEscalationGuardrail
Bases: Guardrail
Guardrail to detect privilege escalation prompts using an OpenAI language model.
This class uses an OpenAI language model to predict whether a given prompt is attempting to perform a privilege escalation. It does so by sending the prompt to the model along with predefined system and user prompts, and then analyzing the model's response.
Attributes:
Name | Type | Description |
---|---|---|
llm_model |
OpenAIModel
|
The language model used to predict privilege escalation. |
Methods:
Name | Description |
---|---|
guard |
str, **kwargs) -> dict: Analyzes the given prompt to determine if it is a privilege escalation attempt. Returns a dictionary with the analysis result. |
predict |
str) -> dict: A wrapper around the guard method to provide a consistent interface. |
Source code in safeguards/guardrails/privilege_escalation/priv_esc_guardrails.py
guard(prompt, **kwargs)
Analyzes the given prompt to determine if it is a privilege escalation attempt.
This function uses an OpenAI language model to predict whether a given prompt is attempting to perform a privilege escalation. It sends the prompt to the model along with predefined system and user prompts, and then analyzes the model's response. The response is parsed to check if the prompt is a privilege escalation attempt and provides a summary of the reasoning.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompt
|
str
|
The input prompt to be analyzed. |
required |
**kwargs
|
Additional keyword arguments that may be passed to the model's predict method. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
dict |
A dictionary containing the safety status and a summary of the analysis. - "safe" (bool): Indicates whether the prompt is safe (True) or a privilege escalation attempt (False). - "summary" (str): A summary of the reasoning behind the classification. |
Source code in safeguards/guardrails/privilege_escalation/priv_esc_guardrails.py
predict(prompt)
A wrapper around the guard method to provide a consistent interface.
This function calls the guard method to analyze the given prompt and determine if it is a privilege escalation attempt. It returns the result of the guard method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompt
|
str
|
The input prompt to be analyzed. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
A dictionary containing the safety status and a summary of the analysis. |
Source code in safeguards/guardrails/privilege_escalation/priv_esc_guardrails.py
SQLInjectionGuardrail
Bases: Guardrail
A guardrail class designed to detect SQL injection attacks in SQL queries generated by a language model (LLM) based on user prompts.
This class utilizes a pre-trained MobileBERT model for sequence classification to evaluate whether a given SQL query is potentially harmful due to SQL injection. It leverages the model's ability to classify text sequences to determine if the query is safe or indicative of an injection attack.
Attributes:
Name | Type | Description |
---|---|---|
model_name |
str
|
The name of the pre-trained MobileBERT model used for SQL injection detection. |
Methods:
Name | Description |
---|---|
model_post_init |
Any) -> None: Initializes the tokenizer and model for sequence classification, setting the model to evaluation mode and moving it to the appropriate device (CPU or GPU). |
validate_sql_injection |
str) -> int: Processes the input text using the tokenizer and model to predict the class of the SQL query. Returns the predicted class, where 0 indicates a safe query and 1 indicates a potential SQL injection. |
guard |
str) -> dict: Analyzes the given prompt to determine if it results in a SQL injection attack. Returns a dictionary with the safety status and a summary of the analysis. |
predict |
str) -> dict: A wrapper around the guard method to provide a consistent interface for evaluating prompts. |
Source code in safeguards/guardrails/privilege_escalation/priv_esc_guardrails.py
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 |
|
guard(prompt)
Analyzes the given prompt to determine if it results in a SQL injection attack.
This function uses the validate_sql_injection
method to process the input prompt
and predict whether it is a safe query or a potential SQL injection attack. The
prediction is based on a pre-trained MobileBERT model for sequence classification.
The function returns a dictionary containing the safety status and a summary of
the analysis.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompt
|
str
|
The input prompt to be analyzed. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
A dictionary with two keys: - "safe": A boolean indicating whether the prompt is safe (True) or a SQL injection attack (False). - "summary": A string summarizing the analysis result, indicating whether the prompt is a SQL injection attack. |
Source code in safeguards/guardrails/privilege_escalation/priv_esc_guardrails.py
predict(prompt)
A wrapper around the guard
method to provide a consistent interface for evaluating prompts.
This function calls the guard
method to analyze the given prompt and determine if it
results in a SQL injection attack. It returns the same dictionary as the guard
method,
containing the safety status and a summary of the analysis.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompt
|
str
|
The input prompt to be evaluated. |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
A dictionary with two keys: - "safe": A boolean indicating whether the prompt is safe (True) or a SQL injection attack (False). - "summary": A string summarizing the analysis result, indicating whether the prompt is a SQL injection attack. |