If you’re building with AI, chances are you’ve run into issues and questions around tokens. They are at the core of all Large Language Models (LLMs), and are the building blocks of all prompts and responses.
Understanding tokenization—what it is, its significance, and its impact on API usage—is crucial for anyone integrating AI into their product, or even using apps like ChatGPT.
Let’s define some terms, and then run through some real-world examples to better understand how to optimize prompts and interactions with OpenAI’s APIs.
What are tokens?
A token is the basic unit of information that an LLM reads to understand and generate text. A token can be an individual character, a full word, or even part of a word, as you can see in the example below.
Why do models use tokens?
When you write a prompt, it is broken down into tokens immediately. This is because LLMs don’t interpret language in the same way humans do. They are trained on a large amount of text data, which they use to learn patterns. Tokenization turns English into a format the model can work with.
How tokens are split depends on the language and specific tokenization process used.
What actually is tokenization?
Tokenization is the process of breaking down text into tokens. Like how "Evergreen" was broken down into two tokens. In general, there are three major tokenization processes.
1. Word-based - each word is a token. Evergreen = 1 token.
-Advantages: You can include every word in a language.
-Downsides: The model's large vocabulary increases memory usage and time complexity.
2. Character-based - Each character is a token. Evergreen = 9 tokens.
-Advantages: Fewer total tokens needed, leading to simplified memory and time complexity
-Disadvantages: Difficulty in learning meaningful context. “t” gives less context than “tree”.
3. Subword-based - Common subwords are treated as separate tokens. Evergreen = 2 tokens
-Advantages: Covers a large vocabulary with less tokens, by flexibly handling word forms
-Disadvantages: Increased memory and time complexity compared to character-based
OpenAI uses a type of subword-based tokenization called Byte Pair Enconding. This tokenization method is one of the main reasons that these models have gotten so good at understanding, and can generate nuanced responses.
Tokens and API usage
Tokens are one of the main driving factors related to cost, response time, and performance when using OpenAI’s APIs (and most other providers). Understanding how they relate can help you write better prompts, cut costs, and improve user experience.
OpenAI charges different rates for different models, and updates these figures often. In general, longer prompts require more tokens and cost more.
Dealing with token limits
Along with cost, each model has a certain token limit. This limit relates to the tokens used in the prompt AND the response. In practice, tokens from the system message tokens are included as well, though OpenAI hasn't explcitly documented this (yet).
If you are using a model that has a max tokens limit of 4,000 and your prompt has 3500 tokens, then you only have 500 tokens left for your response. If the model generates a completion of more then 500 tokens, then the request will fail and no response will be returned.
Although instructing the model in your prompt not to return an output over 500 characters may get you a response, you'll reduce the quality of the response. It goes against one of the main principles of prompt engineering, give the model room to think. If you’re interested in some of the othe prompt engineering best practices, you can check out our recent article here: 10 Tips for Writing Better Prompts with Any Model.
The more tokens used in a request will slow down the API response time. Keep in mind in the frame of your application. If having speedy responses is more important than detailed responses you might want to adjust the token limit.
Experimenting with different models and token limits takes just a few clicks in PromptHub, allowing for side-by-side comparisons that highlight the tangible differences when running prompts on a model with a lower token limit.
Token limits vs Max Tokens parameter
While the token limit for a model includes the system message, prompt and response, the Max Tokens parameter only relates to the tokens used in the response.
So if you need to constrain your response, you can test using this parameter.
For more info on how Max Tokens works, we did a deep dive on all things related to OpenAI parameters in our article Understanding OpenAI Parameters: Optimize your Prompts for Better Outputs.
How to get around token limits
This will be an article of it’s on in the future. For now here's a high level overview of the options.
Chunking - Divide your text input into smaller pieces that fit within the token limit, and send them as separate requests. Have the model generate responses for each chunk and combine the results. You may lose some contextual coherence depending on your chunking boundries.
Summarization - Condense the entire content before sending it to the model. Rather than sending each chapter of a book, you summarize each chapter into small peices. Unfortunately, this method may cause some loss in detail and nuance.
Prioritization - Make sure the most important part of your prompt fits within the token limit. This takes time and requires you to make some tough decisions in regard to what to keep.
How to make sure you are optimizing usage of OpenAI’s APIs
To optimize the use of OpenAI’s APIs in regard to tokens, you need consider a few things.
Monitor Token Usage: Keep a tab on the token count for your requests (prompts and responses). PromptHub allows you to view the token count for each request and even test it against different models and versions of your prompts.
Be Mindful of Prompt Length: While longer prompts tend to outperform shorter prompts, it's still good to revisit your prompts to see if they can be condensed. Focused prompts can often produce similiar results as longer prompts, but at a fraction of the cost.
Break Text When Needed: Use chunking or any of the methods above when handling large text inputs that exceed the token limit threshold.
Experiment 1: Varying the Token Limit
Experiment time! Lets see how adjusting the token limit impacts the output of the same prompt-system message combination.
Version 1 will have a maximum token limit of 600, and Version 2 will have a limit of 1,500.
Responses
I ran this experiment in PromptHub, using our comparison tools to analyze the responses side-by-side. PromptHub’s Github style versioning and difference-checking make A/B testing easy.
Version 1 (600 token limit) is on the left, and Version 2 (1,500 token limit) is on the right.
Analysis
Version 1 is more concise, a product of the lower token limit. Version 2 goes into more detail, providing more information on destination specifics (outdoor activites, cultural experiences, budget etc).
Both versions have some trade-offs. Version 1 is succinct, uses tokens efficiently, and processes the request twice as fast. Version 2 gives way more details and provides a richer conversation.
This shows the importance of experimenting with token limits (along with other parameters and models) to figure out the optimal set up for your use case. Sometimes short and sweet is better, sometimes you need more detail.
Takeaways
- Optimizing token usage is crucial for cost-effective and efficient use of OpenAI's models. Tokens directly influence both the cost of API usage and response time.
- Managing token limits is key in order to avoid reaching the maximum token limit and hindering the model's ability to produce a response.
- Chunking requests, summarizing content, and writing effect prompts can help manage situations where token limits are exceeded.
- Experiment with different token limits to see the trade-offs between response length and depth