Prompt caching: 10x cheaper LLM tokens, but how?
đź”— a linked post to
ngrok.com »
—
originally shared here on
What's going on in those vast oceans of GPUs that enables providers to give you a 10x discount on input tokens? What are they saving between requests? It's not a case of saving the response and re-using it if the same prompt is sent again, it's easy to verify that this isn't happening through the API. Write a prompt, send it a dozen times, notice that you get different responses each time even when the usage section shows cached input tokens.
Not satisfied with the answers in the vendor documentation, which do a good job of explaining how to use prompt caching but sidestep the question of what is actually being cached, I decided to go deeper. I went down the rabbit hole of how LLMs work until I understood the precise data providers cache, what it's used for, and how it makes everything faster and cheaper for everyone.
After reading the Joan Westenberg article I posted yesterday, I decided I’m going to read more technical articles and focus my attention on them.
This post from the ngrok blog was very helpful in explaining how LLMs work up through the attention phase, which is where prompt caching happens.
It also got me to go down a rabbit hole to remember how matrix multiplication works. I haven’t heard the phrase “dot product” since high school.