The images you see above were entirely generated by Artificial Intelligence (AI), thanks to a famous text2image model that can create images from user-defined text. This kind of text is defined as a “prompt” and its quality influences the image quality. The text input to AI models has recently become mainstream due to art generation models such as DALLE2 and Large Language Models (LLM) such as GPT4 and the famous ChatGPT.
Prompts enforce the rules and ensure the specific qualities (and quantities) of the generated output. But what does it mean “high-quality prompt”? Let’s continue to consider DALLE2 because the image quality is easier to be evaluated at first glance with respect to text quality.
Using the prompt “girl with long red hair, digital art” we got the following image:
While amazing as generated by an AI model, it would be a rather mediocre result if done by a human. Only by using a more effective prompt is it possible to exploit the truly great potential of DALLE2 and other generative models.
For example, in the open text2image model Stable Diffusion using the long specific prompt: “Aurora, girl with super long red hair, hair becoming bright stars, intricate, highly detailed, digital painting, Artstation, concept art, smooth, sharp focus, illustration, unreal engine 5, 8 k, art by Artgerm and Greg Rutkowski and Alphonse Mucha” we got the following result, definitely of higher quality:
We define “Prompt Engineering” a concept of Natural Language Processing (NLP) that implies the discovery and usage of text input to produce useful and desirable results by a human user.
These results can be:
– Images generated from a textual description of the desired visual output
– Text answers to the user questions
Why use Prompt Engineering?
Large Language Models use natural language to communicate with us (e.g answering with a textual output), but it doesn’t mean they reason like humans. To be honest they don’t think at all: in a somewhat simplistic way, an LLM can be defined as a statistical model that generates each word by selecting the most probable one based on the word it has already generated. According to this approach, the LLM selects the most probable word based on the texts it has been trained on, which can involve vast amounts of data, often exceeding that of the entire Wikipedia website.
Even if it replies in the human language, it’s obviously not human. It is like it was another species that can speak our language. For instance, let’s imagine that ChatGPT is a genie:
The poor child didn’t use prompt engineering with the genie and he got horrible results! Which was his mistake? He thought the genie had as much sense as he did and that the demands were clear. But a genie is not a human.
A similar metaphor was used also by Jessica Shien, AI strategist at OpenAI that recommends using the “genius in a room” mental model.
The “genius in a room” model assumes that the AI model (in our metaphor: the genie) doesn’t know anything about you other than what you write on a piece of paper and slide under the door (the textual prompt). Once you can visualize this, you get a more realistic idea of what an LLM like ChatGPT is capable of and what it requires to achieve an accurate result in a reply. Using this mental model, it becomes obvious that the more context you provide the “genius”, the better the answers you will get from it.
Using prompt engineering correctly allows the user to guide the LLM towards the results he or she wants to achieve. We can imagine the prompt as a slide that facilitates the goal.
Vice versa, if the user lacks an understanding of prompt engineering or opts not to employ it, a mountain stands between them and their goal, making it a challenging climb to achieve the desired outputs. It could still be possible: we must remember that LLM and text2image models are probabilistic ones. In theory, we can achieve nice results even without prompt engineering, but it is harder and more unlikely. According to the domain, it could be even impossible.
Prompt Engineering: aspects to consider
It is important to stress that there are different ways to do prompt engineering.
To generate the redhead girl image above, the prompt contains a significant amount of information regarding the girl’s appearance, including her style, and numerous relevant keywords were added. In prompt engineering, keywords are particular words that can trigger a certain type of result due to the training.
For instance, Stable Diffusion was trained on many beautiful Artstation images with this tag, so “artstation” is a keyword that triggers that kind of image. However, this is not the only way to communicate with an LLM. For correct prompt engineering, we must consider many aspects.
- Model type
The first and more immediate aspect is the type of model. For instance, DALLE2 generates an image starting only from a single user prompt, while ChatGPT has a conversational nature and each chat is associated with the context. The latter allows LLM to “remember” within some limit the user question and the LLM answers in that conversation.
- Model’s training data
The second aspect is the data on which that particular model was trained. Typically big corporations don’t disclose this kind of information and it is possible to know only partially the kind of data they used. For instance, ChatGPT used data from the internet that included a massive 570 GB of data sourced from books, Wikipedia, research articles, webtexts, websites and other forms of content and writing on the net. Approximately 300 billion words were fed into the system.
The third aspect is our goal. Beyond the specific domain, the type of goal we want to achieve determines the format of the prompt. Classic examples of goal types are:
– Output Customization: to create snippet code in a specific programming language and/or paradigm; mail template generation with a certain tone and/or receiver
– Information retrieval: ask for information about some concept
– Refinement: get from the LLM an improved version of the user question
Thus, it becomes evident that prompt engineering rules, both existing and emerging, must be tailored to the elements described earlier. Depending on the specific combination of these elements, it’s necessary to determine the appropriate “dialect” and prompt format that best interfaces with the model.
How can we do Prompt Engineering, a ChatGPT example
A classical approach is to study prompt engineering and then use within the target LLM various reformulations of the same prompt. In this way, with a few aimed shots, all different from each other but at the same time similar, it is possible to obtain the desired results more easily.
To understand the difference between using or not Prompt Engineering it is possible to spot the differences in results between the following ChatGPT conversations:
In the first case, I used a generic prompt and indeed I got a generic answer. In the second picture, we can notice the use of some basic Prompt Engineering notions:
– Explain the problem we want LLM solves (e.g. explain Computer Vision focusing on business effects)
– Indicate format (e.g. length) and style
– Give information and knowledge (e.g. a correct context) if needed
In conclusion, to fully leverage the potential of an LLM, one must engage in prompt engineering and adopt the appropriate mindset when interfacing with the model. Therefore, just as we pay attention to our words when talking to someone, we should also pay attention to the words we use in prompts, remembering the “genius in a room” approach.
Otherwise, even in the most revolutionary tools, the basic lesson is always the same: garbage in, garbage out.