LLM Guardrail
PromptInjectionLLMGuardrail
Bases: Guardrail
The PromptInjectionLLMGuardrail
uses a summarized version of the research paper
An Early Categorization of Prompt Injection Attacks on Large Language Models
to assess whether a prompt is a prompt injection attack or not.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
llm_model
|
OpenAIModel
|
The LLM model to use for the guardrail. |
required |
Source code in safeguards/guardrails/injection/llm_guardrail.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
|
format_prompts(prompt)
Formats the user and system prompts for assessing potential prompt injection attacks.
This function constructs two types of prompts: a user prompt and a system prompt.
The user prompt includes the content of a research paper on prompt injection attacks,
which is loaded using the load_prompt_injection_survey
method. This content is
wrapped in a specific format to serve as a reference for the assessment process.
The user prompt also includes the input prompt that needs to be evaluated for
potential injection attacks, enclosed within
The system prompt provides detailed instructions to an expert system on how to analyze the input prompt. It specifies that the system should use the research papers as a reference to determine if the input prompt is a prompt injection attack, and if so, classify it as a direct or indirect attack and identify the specific type. The system is instructed to provide a detailed explanation of its assessment, citing specific parts of the research papers, and to follow strict guidelines to ensure accuracy and clarity.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompt
|
str
|
The input prompt to be assessed for potential injection attacks. |
required |
Returns:
Name | Type | Description |
---|---|---|
tuple |
str
|
A tuple containing the formatted user prompt and system prompt. |
Source code in safeguards/guardrails/injection/llm_guardrail.py
guard(prompt, **kwargs)
Assesses the given input prompt for potential prompt injection attacks and provides a summary.
This function uses the predict
method to determine whether the input prompt is a prompt injection attack.
It then constructs a summary based on the prediction, indicating whether the prompt is safe or an attack.
If the prompt is deemed an attack, the summary specifies whether it is a direct or indirect attack and the type of attack.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompt
|
str
|
The input prompt to be assessed for potential injection attacks. |
required |
**kwargs
|
Additional keyword arguments to be passed to the |
{}
|
Returns:
Name | Type | Description |
---|---|---|
dict |
list[str]
|
A dictionary containing: - "safe" (bool): Indicates whether the prompt is safe (True) or an injection attack (False). - "summary" (str): A summary of the assessment, including the type of attack and explanation if applicable. |
Source code in safeguards/guardrails/injection/llm_guardrail.py
load_prompt_injection_survey()
Loads the prompt injection survey content from a markdown file, wraps it in
<research_paper>...</research_paper>
tags, and returns it as a string.
This function constructs the file path to the markdown file containing the
summarized research paper on prompt injection attacks. It reads the content
of the file, wraps it in
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The content of the prompt injection survey wrapped in |
Source code in safeguards/guardrails/injection/llm_guardrail.py
predict(prompt, **kwargs)
Predicts whether the given input prompt is a prompt injection attack.
This function formats the user and system prompts using the format_prompts
method,
which includes the content of research papers and the input prompt to be assessed.
It then uses the llm_model
to predict the nature of the input prompt by providing
the formatted prompts and expecting a response in the SurveyGuardrailResponse
format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prompt
|
str
|
The input prompt to be assessed for potential injection attacks. |
required |
**kwargs
|
Additional keyword arguments to be passed to the |
{}
|
Returns:
Type | Description |
---|---|
list[str]
|
list[str]: The parsed response from the model, indicating the assessment of the input prompt. |