We were unable to load Disqus. If you are a moderator please see our troubleshooting guide.
I appreciate this! Thanks Swaroop!
For those looking for lightweight orchestration that works with your libraries of choice: https://github.com/dagworks-inc/burr is great for more agentic workflows (cycles, conditional branching), while https://github.com/dagworks-inc/hamilton is great for more pipeline (DAG) workflows.
What a great article. Thank you, Chip!
Excellent article! Much of its content is applicable to various model serving scenarios, not just generative models.
One of the finest articles on RAG deployments!
Thanks Chip, for sharing valuable and important information about RAG.
First, thank you for you and all that you do. I own ML Sys book and am reading the AI Engineering book on O`Reilly right now. One thing that I hope you address in your book or a future blog post is when to roll your own vs using abstractions. It seems like many components in this eco-system are, how to put it, brittle and ever-changing? I hear people going without certain abstraction frameworks and doing fine. Have yet to really see this addressed anywhere in depth.
I have read your blog post and found it extremely interesting.
I would like to translate the content of this article into Japanese and share it on Zenn , an engineering community platform. In the process of translation and introduction, I will clearly state the original source and respect your copyright. I will also include a link to the original article in the translated version.
I am writing to ask for your permission to translate and introduce this article. I would greatly appreciate your consideration of this request.
I look forward to hearing from you.
Go for it!
Awesome article! Though I'd change the title to "Building A Generative AI System".
Chip really useful explanation here, good parallels for firms that may have experience doing this level of detailed work for ML infrastructure in the past and seen how it swiftly becomes complex.
Thanks for your generosity here, it's much appreciated.
Justin
Very comprehensive post! thank you for taking the time to write it.
Would you mind sharing some strategies you recommend for optimizing latency without compromising model complexity in real-time applications?
This is amazing! Thank you for sharing.
Great work, looking forward for your new book!
If X is less than the similarity threshold you set, the cached query is considered the same as the current query, and the cached results are returned. If not, process this current query and cache it together with its embedding and results.
^ Should this be "If X is more than the similarity threshold you set,"?
good catch. fixed. thanks!
Another failure management method could be to send the original failure output as well as info about the type of failure back to the model to provide more context and hopefully get a correct response.
Thanks for the material!
A notable absence that I see in this type of material is about practical tips about using and configuring inference engines to run the model inference.
it's not covered here because the post is already long and many people might use inference services provided by model API companies. but inference servers will be covered in my book AI Engineering!
Where is the arrow pointing TO the database?
Thanks for the insights. One remark:
MLflow AI gateway is deprecated and has been replaced by the deployments API <deployments> for generative AI. See MLflow AI Gateway Migration Guide for migration.
You might want to fix the link.
Thank you for the great overview Chip!
Would like to also mention some other pieces that I’ve been thinking about on top of the Gateway + guardrails architecture:
- Evals: A standard evals tool can help teams create their own eval suite easily, e.g. PromptFoo library on top of Portkey gateway.
- Logs UI: Slicing and dicing the production logs (including viewing images, audio, etc.) to curate datasets, for both evals & fine-tuning
- Fine-tuning: Gateway can also include fine-tuning abstraction APIs (e.g. using same API spec as the OpenAI fine-tuning APIs)
- Monitoring & Alerting: Observability on metrics & cost (you've already mentioned this)
- RAG: Gateway can also include RAG abstraction APIs, providing flexibility to switch vendors easily, for accuracy or cost trade-offs
- Batch generation APIs: Gateway can also include batch inference abstraction APIs (e.g. using same API spec as the OpenAI batch inference APIs)
- Routing to cheapest model: Gateway can include a router that inspects prompt complexity and routes to the cheapest model that can answer, e.g. RouteLLM , Martian .