We were unable to load Disqus. If you are a moderator please see our troubleshooting guide.

Swaroop Chitlur • 2 months ago

Thank you for the great overview Chip!

Would like to also mention some other pieces that I’ve been thinking about on top of the Gateway + guardrails architecture:

- Evals: A standard evals tool can help teams create their own eval suite easily, e.g. PromptFoo library on top of Portkey gateway.
- Logs UI: Slicing and dicing the production logs (including viewing images, audio, etc.) to curate datasets, for both evals & fine-tuning
- Fine-tuning: Gateway can also include fine-tuning abstraction APIs (e.g. using same API spec as the OpenAI fine-tuning APIs)
- Monitoring & Alerting: Observability on metrics & cost (you've already mentioned this)
- RAG: Gateway can also include RAG abstraction APIs, providing flexibility to switch vendors easily, for accuracy or cost trade-offs
- Batch generation APIs: Gateway can also include batch inference abstraction APIs (e.g. using same API spec as the OpenAI batch inference APIs)
- Routing to cheapest model: Gateway can include a router that inspects prompt complexity and routes to the cheapest model that can answer, e.g. RouteLLM , Martian .

Chip • 1 month ago

I appreciate this! Thanks Swaroop!

Stefan Krawczyk • 1 month ago

For those looking for lightweight orchestration that works with your libraries of choice: https://github.com/dagworks-inc/burr is great for more agentic workflows (cycles, conditional branching), while https://github.com/dagworks-inc/hamilton is great for more pipeline (DAG) workflows.

Justin Shiu • 2 months ago

What a great article. Thank you, Chip!

Manoj Agarwal • 2 months ago

Excellent article! Much of its content is applicable to various model serving scenarios, not just generative models.

Karan Shingde • 2 months ago

One of the finest articles on RAG deployments!

Thanks Chip, for sharing valuable and important information about RAG.

Charles Frenzel • 4 weeks ago

First, thank you for you and all that you do. I own ML Sys book and am reading the AI Engineering book on O`Reilly right now. One thing that I hope you address in your book or a future blog post is when to roll your own vs using abstractions. It seems like many components in this eco-system are, how to put it, brittle and ever-changing? I hear people going without certain abstraction frameworks and doing fine. Have yet to really see this addressed anywhere in depth.

yudai yamamoto • 4 weeks ago

I have read your blog post and found it extremely interesting.

I would like to translate the content of this article into Japanese and share it on Zenn , an engineering community platform. In the process of translation and introduction, I will clearly state the original source and respect your copyright. I will also include a link to the original article in the translated version.
I am writing to ask for your permission to translate and introduce this article. I would greatly appreciate your consideration of this request.

I look forward to hearing from you.

Chip • 4 weeks ago

Go for it!

Bruno Scaglione • 1 month ago

Awesome article! Though I'd change the title to "Building A Generative AI System".

Justin Townsend • 1 month ago

Chip really useful explanation here, good parallels for firms that may have experience doing this level of detailed work for ML infrastructure in the past and seen how it swiftly becomes complex.

Thanks for your generosity here, it's much appreciated.

Justin

Mohamed El-Refaey • 1 month ago

Very comprehensive post! thank you for taking the time to write it.
Would you mind sharing some strategies you recommend for optimizing latency without compromising model complexity in real-time applications?

Scott Havird • 1 month ago

This is amazing! Thank you for sharing.

Bo Liu • 1 month ago

Great work, looking forward for your new book!

Sajal Sharma • 1 month ago

If X is less than the similarity threshold you set, the cached query is considered the same as the current query, and the cached results are returned. If not, process this current query and cache it together with its embedding and results.

^ Should this be "If X is more than the similarity threshold you set,"?

Chip • 1 month ago

good catch. fixed. thanks!

Sajal Sharma • 1 month ago

Another failure management method could be to send the original failure output as well as info about the type of failure back to the model to provide more context and hopefully get a correct response.

arisco • 1 month ago

Thanks for the material!

A notable absence that I see in this type of material is about practical tips about using and configuring inference engines to run the model inference.

Chip • 1 month ago

it's not covered here because the post is already long and many people might use inference services provided by model API companies. but inference servers will be covered in my book AI Engineering!

Mark Kerzner • 2 months ago

Where is the arrow pointing TO the database?

Ken Jiiii • 2 months ago

Thanks for the insights. One remark:

MLflow AI gateway is deprecated and has been replaced by the deployments API <deployments> for generative AI. See MLflow AI Gateway Migration Guide for migration.

You might want to fix the link.