Last week we began a discussion of the differences between using APIs backed by traditional business logic and those backed by Large Language Models (LLMs). Part 1 covered
Background: LLMs and how they differ from business logic
Tokens: What are LLM tokens and why do they matter?
Requests: Considerations when making requests to LLM APIs
If you missed it, I recommend reading the first part before continuing.
If you are already up to speed, here are some salient reminders.
A token is a common sequence of characters found in text. A standard estimate is that 100 tokens ~= 75 English words.
Also:
The nature of the underlying processing leads to some significant differences in behavior. LLM APIs:
Are much slower to return results
Are more expensive, either in direct cost or processing power
Have significant variance in output when given the same input
And also:
A few caveats about this post:
When I say API, I’m referring specifically to REST APIs. There are other API approaches and the principles will be similar.
I will be referencing OpenAI’s GPT models and Anthropic’s Claude models since those are the 3rd party LLM API providers I have the most experience with.
Specific numbers are accurate as of writing in August 2023, but may quickly become outdated as new models and approaches emerge.
Alright! Let’s pick up where we left off.
Response Time
We’re accustomed to logic-based API calls whose response time is measured in 100s of milliseconds. An API call that takes several seconds is considered slow.
Non-trivial LLM generation time is measured in 10s of seconds. If you’ve used ChatGPT, you’ve seen how long it can take to write a page-long reply. Generation time scales with output length and can vary wildly from call to call.
This has several implications for clients that rely on LLM generation.
User Workflow
With logic-based APIs, we’ve learned to never block the UI on an API call. You want asynchronous handling so the application is responsive, the user can cancel, etc. With LLM-based APIs, you have to take this further and avoid blocking not just the user’s UI but the user’s workflow.
If a user’s workflow is blocked by a loading spinner that lasts a few seconds, they will be annoyed at the interruption but continue. But a 30-second spinner? They are moving away from your client to do something else.
Reduce Generation Time
Even if you keep the workflow unblocked, you should also strive to keep the generation time as low as possible. There are a few techniques you can try depending on the nature of your application:
If you will be generating a lot of content, prefer multiple short prompts to a single long prompt.
If these prompts don’t depend on each other, parallelize the requests.
Specify the output format in the prompt. Ask for the response in a strict format (JSON works well), suggest a target word count, or simply ask it to “be concise” (which can be surprisingly effective).
Reduce Perception of Time
There are limits to how much control you have over the generation time, so you should also work to reduce the user’s perception of time. UI/UX designers have built up quite the repertoire of techniques for this.
Consider ChatGPT itself. It doesn’t wait for the entire generation before handing you some response, it provides it line by line, giving the illusion of typing back to you. This is not only a better experience, but it actually makes the overall generation type feel shorter. If this fits your use case, OpenAI allows you to do the same using server-sent events.
Manage Expectations
Chances are that your users are also accustomed to quick responses. Once you’ve done all you can to keep the workflow unblocked, reduce the generation time, and reduce the perception of generation time, you may still be running longer than your users expect.
At this point, all you can do is try to manage your users’ expectations. Make it clear which actions can be expected to take some time. Remind them, subtly, how much time they are saving in the long run by waiting a little now. They’ll forgive you if they know it is worth it.
Response Content
Considerations related to the generated content.
Randomness
Logic-based APIs appear deterministic - given the same inputs, you can expect to receive the same outputs. The reality is more complex, of course, but from the user’s point of view taking identical actions should have identical results.
LLM-based APIs appear nondeterministic - given the same inputs, you can expect to receive different outputs. Every time. If you don’t account for this in your design then users may see your client as inconsistent and untrustworthy.
One simple way to account for this is to turn it into a feature. Add a Retry button and make it clear to the user that they can get different versions of whatever result.
Another option is to use the LLM’s temperature setting to adjust the level of randomness. You can even set it to zero to minimize the randomness (at the expense of the creativity that can make LLMs so powerful).
Inaccuracy
Logic-based APIs return what you’ve told them to return. Barring bugs or bad data, you can trust the results to be accurate.
LLMs are known to hallucinate, which means they will make things up. Nonexistent API definitions, historical events, court cases, etc. They are designed to respond to you, and if they don’t have an answer they will tell you something that sounds like an answer.
If you can, validate the response. If you can’t, well… as far as I know, no one has a solution for this yet. Back to managing user expectations.
Prompt Injection
We’ve long dealt with various code injection attacks in our application, the most famous of which are probably SQL injection and cross-site scripting (XSS).
LLMs introduce a fun new one: prompt injection. If you take user-provided text and include it as part of your prompt, then a malicious user can override the expected result. If you haven’t tried this yourself, Gandalf is a fun introduction to the idea.
Mitigations exist (and are evolving quickly and thus left as an exercise for the reader), but there are no solutions. You must assume that a user can make the LLM return whatever they want. Your client must be robust against wildly unexpected responses.
Do not crash if the LLM returns garbage
Do not append generated HTML directly to the DOM
Do not use returned values to make programmatic decisions without validation
Avoid putting anything in your prompt that you can’t have mirrored back to the user
What a wild ride
LLMs can do things that we barely imagined scant years ago, with the ecosystem continuing to evolve extremely rapidly. Even those who are devoted full-time to LLM matters cannot keep up with the pace of change.
But as powerful as they are, LLMs can be slow, expensive, and unpredictable to operationalize. If you are going to rely on them in your applications, take care to design around their many quirks.
Of course, within these problems lies great opportunity. If you can solve for the downsides of any of these LLM quirks, there is a huge market just waiting for you.


Regarding managing expectations...
https://www.theguardian.com/world/2023/aug/10/pak-n-save-savey-meal-bot-ai-app-malfunction-recipes
"Please be aware that suggested recipes might not be fit for human consumption. Kids, don't try these at home."