Weekly Check-in: 2/16
Explorations on generative AI
For the sake of accountability, I am trying to regularly writing check-ins on my work in progress.
See all check-ins here.Small robot holding pen, landscape painting, Stable Diffusion 1.5
Lately I've been thinking (along with everyone else, I'm sure) about the phenomenon that is ChatGPT, and the underlying LLM technology behind it. I've been talking to teams, exploring all of the different language models, and trying to understand where it's all headed.
Along the way, I wrote about five LLM-adjacent ideas that I'm sharing with the world in case it sparks anyone's imagination.
👀 Monitoring for LLMs (newrelic for LLMs)
Problem: deploying large language models in your apps adds a new layer of complexity that traditional APMs like NewRelic won’t be good at analyzing. Teams will want visibility into how these models are performing.
Solution:
- tracking all API calls, latency, responses, token count, estimated costs
- being able to figure out how often calls are failing
- being able to figure out which queries are costing you the most money
- being able to figure out which queries are the slowest, p90 and p95 latency
- being able to figure out which responses are likely poor quality (too short, too generic)
- detecting prompts & completions that might be harmful / nsfw
- possibly adding some money-saving measures such as caching, token re-writing
Why this matters in the long run:
- LLMs are a bit black box and can’t always be relied upon to work consistently. until the models are > 95% trustworthy, human oversight will be needed at some level for any critical deployment of AI within applications
- so, teams will have to study what the APIs are returning on some sample of real user data to gain some insight into how their application actually performs
Risks:
- if the APIs get much cheaper and more reliable, monitoring of those dimensions decreases in importance. many apps don’t really monitor all of their API dependencies until they become a large enough problem
- tracking all API calls is a fairly straightforward task, some companies will find it easy to do themselves, and there will be competitors.
Tim note: I talked to a YC company that's doing this: Helicone
🤑 Cost Management for LLMs (kubecost for LLMs)
Problem: today’s LLMs are pretty expensive, and there are many opportunities to reduce cost by reducing tokens, intelligent caching, and using different models, but practitioners don’t always know the best tricks / it is time consuming.
Solution:
- help users understand their costs, and opportunities to decrease cost
- token optimization - re-write prompts to shrink the number of input tokens. Help users understand the ideal size of output tokens.
- model optimization - test prompts across a variety of models. Suggest the optimal model for the given task.
- intelligent caching - use “sentence similarity” to cluster completions that are likely to produce the same output. For example “what’s the capital of ny” and “new york capital” are functionally equivalent.
Why this matters in the long run:
- as long as running these models is computationally expensive, cost optimization is relevant
- OpenAI and the like aren’t incentivized to help users lower their spend on their platform
Risks:
- if a good free model ever comes along, this won’t be as valuable
- saving money may not be sufficient to build a monthly subscription business unless it’s delivering monthly value
🆓 Free ad-powered LLMs
Problem: people love free stuff. Charging per AI response won’t work for many consumer applications.
Solution:
- offer a free API that sits on top of leading LLMs
- insert ads in responses (either with an explicit ad unit or subtly within the conversation)
- focus on use cases that are high intent / high value. rewrite Amazon links.
- start with existing ad network (e.g. Google AdSense), work on building independent ad inventory focused on what people chat about
- we can also provide “Ad Network for LLM-powered services” for services that want to run their own LLM but also want ads.
Why this matters in the long run:
- the costs of OpenAI make it prohibitive for a large category of applications
- opening the door to high-quality free LLMs also enables personal & hobbyist applications
Risks:
- Google and Microsoft can run a free LLM service at a loss for a long time to build developer marketshare and mindshare
🚜 MLOps for LLMs
Problem: trying to do traditional machine-learning tasks with LLMs is much easier in some respects, but ML teams will want the best practices from traditional MLOps, including model lineage, versioning, deployment, monitoring and retraining for systems built on top of open-source LLMs like Flan and GPT-neo
Solution:
- offer a set of open-source tools for working with LLMs through the command line and github
- offer a hosted service for spinning up a pipeline with one command
- the initial pieces should include fine-tuning, validation, versioning, and deployment
Why this matters in the long run:
- i have a feeling that a lot of workflows that are currently custom ML models will migrate to LLMs, as it is a lot easier to build on a foundation model than train a new model from scratch to do many tasks
- all of the tooling around production systems ends up getting managed in a similar way with the DevOps philosophy, so it seems inevitable that LLM workflows will be the same
- adding a familiar, high-collaboration workflow will enable better collaboration across ML and engineering teams to build better products
- no matter whether LLMs are self-hosted or managed, the tooling around them (fine-tuning, versioning, and validation) are unique to each company and need to be managed in-house
Risks:
- fine-tuning and custom models turn out to be not that important to many companies
🤔 Prompt/Model Evaluation for LLMs (validation tests for LLMs)
Problem: Making changes to a prompt can have unintended consequences that need to be tested across many different examples, but manually testing things in the OpenAI playground is a very slow and error-prone way to iterate. A systematic way to test a large number of combinations would give developers the confidence to experiment with many models and prompts.
Solution:
- a matrix-based testing tool where users can set up models, prompts, and data to test
- a validation tool that allows users to manually validate results, use AI scoring, or MTurk
- a proprietary AI model that evaluates the quality of generated output based on the task
Why this matters in the long run:
- letting teams collaborate and experiment on prompts is an important way to “engineer” LLMs to produce predictably good results, which is really important for building applications
- when users report bugs and issues with generation, all of these need to be baked into the system
- the number of LLM models will also likely explode, meaning the evaluation effort of picking the right latency/cost/quality tradeoff will increase
Risks:
- this problem seems fairly obvious, I imagine many competitors will appear
- it’s unclear how often companies will need to do this. it’s also unclear how painful the problem is.
- it’s very similar to the MLOps problem above, using CLI tools may be the preferred way for technical teams to solve this class of problems.
Tim note: Scale has Spellbook which is basically this
I'm not working on any of these - I'll write about what I'm working on next time.
Goals
Time for an update on Q1 goals:
Startup:
- Talk to 50 prospective customers
- 39 so far. Looks like I'm on track for this, though as the product has been changing, my target customer has been changing too so it's not like I've been talking to 50 of the same type of person unfortunately.
- Get 3 design partners / pilot customers
- Working on this now
Personal:
- Work out 20 times per month (~5x a week)
- On track for 22 times in Feb.