In a recent study from IBM, researchers analyzed prompt sessions from a dataset of 1,712 users. The study analyzed prompt editing sessions in an enterprise environment and the findings were published: Exploring Prompt Engineering Practices in the Enterprise.
This is a real world glimpse into the tendencies and practices of folks working on prompts.
There are tons of other interesting pieces of information nested in the paper, like which prompt components changed the least, how often parameters were tweaked and more. Additionally, the study unveiled surprising findings that challenge common assumptions.
Let's dive in.
Methodology Overview
A few quick notes on how the data was collected:
- The researchers gathered data from 1,712 users, and narrowed the dataset down to analyze 57 selected users.
- The tool from which they collected the usage data sounds similar to PromptHub. There is a UI where users can write prompts, view outputs and modify prompt parameters.
- While the data was anonymized, the tool was open to internal use across the whole company. So the usage could include users of varying prompt engineering experience.
- The data was sampled such that it was diverse based on how active the users were. Ranging from users with fewer than four prompts to those with over thirty.
Overall, the analysis focused on identifying prompt editing patterns and classifying the iterative changes made over time.
Qualitative analysis/results
The team used a similar tool to what we have in PromptHub to check the differences between pairs of successive prompts in order to label the different types of edits.
Session length
Lets dive into some graphs, starting with session length.
The average prompt editing session was 43.3 minutes.
The time between one prompt version to the next was ~50 seconds, highlighting that this process is extremely iterative.
Prompt variant similarity
To get a better understanding of how much a prompt changed from one version to the next, the researchers calculated the sequence similarity between prompt iterations.
As you can see in the graph above, the similarity ratios skew to the right. This means that the prompt edits being made are relatively small, again, pointing to the iterative nature of prompt engineering.
A similarity ratio of 1 means that the prompt didn’t change at all. In these cases, the users were usually tweaking the model used or other parameters. Holding the prompt constant while tweaking parameters is an ideal way to test variants. By only tweaking one variable you can better understand its effects.
Parameter changes
Speaking of parameter changes, 93% of sessions eventually involved a parameter/model tweak of some sort. The most popular change was testing different models. I was surprised that temperature wasn’t in the top three.
This graph represents the occurrence of parameter changes as a percentage of sessions where a parameter did change.
For example, in sessions that included a parameter change, 75% of the time that included a model change.
Given the frequency of model changes, the researchers dove deeper and analyzed how many models were used within a single prompting session.
The average number of models tested in a session was 3.6. This underscores the importance of being able to easily test different models. In PromptHub we make it easy to test across a variety of models, all in a single view.
Prompt component classification
The researchers broke down prompt edits into a few categories: modified, added, changed, removed, formatted and other.
Modifying a prompt is akin to making minor tweaks. This was the most common edit type, followed by adding on additional text to the prompt.
Prompts were broken down into components to allow for a more detailed analysis of edits.
Prompt component edits
Context was by far the most edited component. Context usually takes the form examples or grounding data to help guide the model through a task. Few-shot prompting and simulating dialog are two common use cases.
Here’s the interesting part about context.
The researchers found that users would start with a particular context to refine the instruction and prompt overall until they were (presumably) happy with the output. After that, to ensure the prompt still works with different context, users would switch out the context and see how the outputs looked.
This type of testing is key to ensure your prompt performs in a variety of circumstances. This is easy to do in PromptHub via batch testing and datasets. Just bring over a CSV of data and with a few clicks you can run your prompt over all the rows.
Multi-edits
22% of all edits were multi-edits where a user made more than one edit before running the next prompt. When multiple edits where made, the average number was two edits. Not super high, but in theory, it would make more sense to do one edit at a time to isolate changes.
But, if you are changing some of the context, then you may need to tweak labels to account for the new context. In that case, while multiple components of the prompt are edited, they are all related and can almost be seen as a component.
The research backs this up. 68% of sessions with multiple edits included at least one context edit. Almost half included a context edit + an instruction edit.
Rollbacks
A key part of the prompt editing process is being able to rollback to previous versions. This was one of the first features we built into PromptHub
11% of all prompt changes were rollbacks. Certain prompt components were much more likely to be rolled back.
- 40% of rollbacks were for instructions:handle-unknown (”if you don’t know, respond with”)
- 25% of rollbacks related to instruction:output-length
- 24% of rollbacks were for label edits and 18% of persona edits.
Comparatively, edits to other components like instruction and task were rolled back much less frequently (8-9%).
With this in mind, lets look back at the frequency graph above (the one with the yellow bars). The components that had higher rollback rates were edited less frequently. Users probably saw that editing these components led to worse results and so they rolled back changes and then edited them less frequently.
Wrapping up
This is one of the larger studies of prompt engineering happening in the real world. A lot of things we see and practice ourselves were aligned with the results, like how prompt engineering sessions are relatively long and require iteration across various variables. But there were also some surprising findings, like how context edits outnumbered instruction edits.
With PromptHub, we hope we can help teams make this process easier. If you’re interested in trying out PromptHub, let me know.