Lately I’ve been spending a lot of time in my day job thinking about how we should build products and features that rely on Large Language Model (LLM) text generation. One aspect I find particularly interesting is the differences between using APIs backed by LLM generation and APIs backed by traditional business logic.
I wrote up some guidance for our application developers and figured it might be useful to talk about here as well.
As sometimes happens once I get writing, this post turned out much longer than I was expecting. I try to ensure these posts can be read in a short sitting so I have split it into two.
“I have made this longer than usual because I have not had time to make it shorter.” - Blaise Pascal
A few caveats about this post:
When I say API, I’m referring specifically to REST APIs. There are other API approaches and the principles will be similar.
I will be referencing OpenAI’s GPT models and Anthropic’s Claude models since those are the 3rd party LLM API providers I have the most experience with.
Specific numbers are accurate as of writing in August 2023, but may quickly become outdated as new models and approaches emerge.
Background
LLM APIs leverage sophisticated machine learning algorithms to generate human-like text, providing capabilities ranging from simple text completion to complex content creation.
Logic-based APIs rely on predefined rules and traditional programming logic to process requests and return data, offering structured, predictable outcomes.
The nature of the underlying processing leads to some significant differences in behavior. LLM APIs:
Are much slower to return results
Are more expensive, either in direct cost or processing power
Have significant variance in output when given the same input
The value of LLMs is that they can solve problems that cannot be solved by business logic, either in the interpretation of the input or the generation of the output.
We’ll consider some of these differences in more detail, but it will be helpful if we first have a basic understanding of tokens.
Tokens
LLMs count both their inputs and outputs in tokens, which are described by OpenAI as “common sequences of characters found in text”.
A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).
Token counts are closely associated with the cost, runtime, and limitations of specific models. As such, it would be great if we could know in advance the number of tokens that a request will use. But we don’t! There are libraries we can use to measure the token costs of certain inputs, but we have limited control over the length of the generated responses.
To make tokens weirder to work with, they tend to be optimized for English text, which means if you are interacting in another language it may require a higher token count for the same amount of text.
With that in mind, what should we consider when writing LLM-based features?
Request
Considerations relating to making requests.
Input Size
LLMs have a limit to how many tokens they can process in a given request. This is referred to as the context window. The context window varies dramatically from model to model. The cheaper version of GPT-3.5 has a 4k token limit, while Claude has a 100k token limit.
So what happens if you hit the end of the context window? You receive a partial response with an indicator that the length was overrun. Well-designed logic-based APIs will either succeed or fail. Supporting the partial success state of LLMs introduces additional complexity.
When building a client, you need to make an intentional decision of how to treat a partial success. If you are building a chatbot, there may be no difference between partial success and complete success. You return the content to the user and move one. If you are expecting the LLM to generate content in a specific format, like JSON or HTML, then a partial response might be considered an unparseable failure.
You should also ensure that the inputs you provide and outputs you request will fit well within the context window during normal usage.
Rate Limiting
Although rate limiting is common for APIs of any kind, the limits tend to be much stricter for LLMs. The limits may be spread across multiple axes: requests per day, requests per minute, and/or tokens per minute. The limits may also be variable.
OpenAI’s rate limits page states
During the rollout of GPT-4, the model will have more aggressive rate limits to keep up with demand. You can view your current rate limits in the rate limits section of the account page. We are unable to accommodate requests for rate limit increases due to capacity constraints.
Anthropic’s rate limits page states
Once you’re ready to go live we’ll discuss the appropriate rate limit with you.
When building a client, it is vital that you respect the LLM rate limits and handle 429 errors appropriately. Consider applying your own rate limits along relevant axes (eg. by user, IP, etc) to keep a prolific user from downing your application for everybody.
Cost
LLM APIs charge by the number of tokens used in the input and outputs, with outputs weighted 1x to 4x higher. The rates differ by model, with more powerful models and larger context windows being more expensive per token.
If you make calls with small inputs and outputs against weaker models with smaller context windows, you will pay tiny, miniscule, fractions of a cent for each call. The worst case degenerates quickly though. Consider 32k context window GPT-4, with a rate of $0.06 / 1K input tokens and $0.12 / 1K output tokens. If an API call sends 2k input tokens and causes the response to max out at 30k output tokens, well that single call costs $3.72. That’s an expensive API call!
Now some logic-based API calls can also be expensive, but those are rare exceptions. We are accustomed to thinking of API calls as essentially free.
When building a client
Avoid sending unnecessary content where possible
Make an attempt to control the output size in the prompt
Make use of traditional caching approaches to not repeat calls unless you really need to.
If it is acceptable for similar input to lead to the same output, consider using fuzzy caching.
PS. If anyone is aware of a decent library/service for fuzzy caching, let me know! I think I know how I’d build this but would rather use something that already exists.
Denial of Service (DoS)
It is worth noting specifically that the rate limits and costs of LLM APIs lend themselves to DoS attacks. You may be familiar with Distributed Denial of Service (DDoS), in which large amounts of traffic coming from many different sources overload a system.
Some other DoS variants include
Model Denial of Service (MDoS): Your systems can handle the malicious traffic just fine, but the requests cause you to hit the LLM provider’s rate limits. This causes your app to lose LLM functionality for everyone until the rate limit resets.
Economic Denial of Sustainability (EDoS): Your systems stay up, but the malicious traffic maxes out your LLM bill, draining your financial resources.
Consider these attacks when planning your rate-limiting and monitoring strategies.
To be continued
Next week Part 2 will pick up where we left off with considerations when handling LLM responses, as well as some high-level takeaways.
LLMs are new, and I still have a lot to learn about how to use them well. If you think I’ve gotten something wrong here, or have a question about how something works, please let us know in a comment!