Cerebral Valley
Posts
BentoML's unique approach to AI inference 🎛

BentoML's unique approach to AI inference 🎛

Plus: Founder & CEO Chaoyu on AI infra and scaling inference...

April 19, 2024

CV Deep Dive

Today, we’re talking with Chaoyu Yang, Founder and CEO of BentoML.

BentoML is an AI inference platform for building fast, secure and scalable AI applications, abstracting away the infrastructure needed for running and scaling AI models efficiently. Founded by Chaoyu in 2019 after his stint at Databricks as an early engineer. The startup tagline is "run Any AI model in your cloud”, and they are targeting enterprise AI teams who build applications on top of proprietary and open source models.

Today, BentoML has thousands of AI/ML teams using its product for serving AI/ML models in production across enterprises and leading AI startups. The team also maintains a growing open source community of over 4000 developers worldwide. In early 2023, the startup announced a $9m Seed round led by DCM Ventures, with Bow Capital also participating.

In this discussion, Chaoyu shares his thoughts on AI infrastructure, Bento’s approach to scaling inference, and how his time at Databricks has shaped him as a leader.

Let’s dive in!

Read time: 8 mins

Our Chat with Chaoyu 💬

Chaoyu - welcome to Cerebral Valley! Firstly, give us a bit about your background, and what led you to start BentoML.

I joined Databricks in 2014 and had the opportunity to work with many early AI/ML adopters among enterprise organizations. The key takeaway for me from that experience was learning about the many challenges and complexity in building the infrastructure required for running and scaling AI/ML workloads efficiently. This led to the founding of BentoML.

With the belief that AI is critical for every organization to win and compete in the future, BentoML wants to power the next generation of enterprise AI applications, with a multi-cloud, open source, and security-first AI inference platform, helping enterprise teams to launch AI powered products faster, while keeping their most valuable data and models securely in their own environment.

🎉 We are excited to announce our Series Seed funding round led by @DCM_VC , with support from @bowcapital and @Firestreakvc !
Coverage by TechCrunch: techcrunch.com/2023/06/26/ben…
We at @bentomlai are on a mission to empower every organization to compete and succeed with AI. As we… twitter.com/i/web/status/1…
— Chaoyu Yang (@chaoyu_)
4:37 AM • Jun 27, 2023

How would you describe Bento’s core value proposition, to an AI/ML developer or team that isn’t familiar with you?

BentoML is best suited for AI/ML developers who are building on top of open source or proprietary models. No matter if you're working with open source models from HuggingFace, fine-tuned LLMs, or custom models with custom inference code, BentoML's platform makes it easy to serve any models in production by providing a serving stack optimized for speed, cost efficiency, and ease-of-use. We help run the infrastructure, taking care of reliability, scalability, observability and performance optimizations for AI inference, so you get to own your models, and build custom AI applications on top with our open source serving frameworks.

A little background here if you are only familiar with proprietary models provided through APIs, such as OpenAI's GPT4. The reason many teams choose to go with self-hosting AI models typically comes down to: 1. More control over cost and latency, e.g. you can choose a smaller model specialized for a given task, or compose multiple models to accomplish a more complex task; 2. Working with custom models trained or fine-tuned with proprietary data; 3. If you are in a highly regulated industry, working with sensitive data, you may need to restrict who can access the data going in and out of your AI models, or keep your model files in a secured environment.

Introducing OpenLLM: Open Source Library for LLMs
A user-friendly platform for operating large language models (LLMs) in production, with features such as fine-tuning, serving, deployment, and monitoring of any LLMs.
kdnuggets.com/2023/07/introd…
— KDnuggets (@kdnuggets)
11:46 PM • Jul 31, 2023

Why is it important to abstract away the infrastructure layer? Why have you chosen to focus on this core problem as the one that Bento is attacking first?

Every organization will need AI to win and compete in the future. Over the last year, we saw countless promising demos and prototypes powered by open source foundation models. While prototypes show potential, they are not meant to serve mission critical workloads or operate at large scale. As more enterprises aim to integrate AI into their products and understand the ROI of AI this year, the underlying infrastructure required to support these AI applications becomes a major obstacle.

Building the infrastructure layer for AI Inference is a very hard problem which we’re solving by providing a fast, secure, cost-efficient platform. If AI teams can easily ship and scale their AI products in production, this unblocks massive value and thus a great opportunity for us as a startup. That's the reason we chose to focus on this problem and built the best team to attack this problem.

We're excited to share our experience and insights on scaling #AI models! 🌟 From #OpenSource models, cold start problem, and to using concurrency as a scaling metric on #BentoCloud, our blog post (bentoml.com/blog/scaling-a…) has it all! Get the comprehensive info on making AI… twitter.com/i/web/status/1…
— BentoML - Run Any AI Models in the Cloud (@bentomlai)
1:34 AM • Apr 5, 2024

How do you see Bento’s core product evolving in the next 6-12 months? What should your customers be excited about?

Developer experience and Infrastructure Efficiency are the two primary areas we are going to continue to invest in.

🎉 Introducing #BentoML 1.2! 🎉 Blog post: bentoml.com/blog/introduci…
New features await:
🛠️ Simplified SDK for more control
👀 Familiar input and output types integrating Pydantic
🚀 Direct deployments with `bentoml deploy .`
🖥️ New #BentoCloud Web UI for intuitive API interaction… twitter.com/i/web/status/1…
— BentoML - Run Any AI Models in the Cloud (@bentomlai)
11:26 PM • Feb 21, 2024

Better developer experience means faster iteration from development to production, shipping new AI features faster, and more flexibility for customization. This is often the #1 reason why a developer chooses BentoML.

Infrastructure efficiency is the key to achieving optimal performance and cost efficiency. We constantly innovate in this area and partner with industry leaders to incorporate the latest inference optimization techniques, making them easily available in our platform. On the other hand, we give customers the flexibility to choose the right deployment strategy and understand different trade-offs. For example, you can deploy throughput-optimized LLM batch inference workloads with the BentoML platform to achieve significantly lower cost per token at a higher latency, instead of a typical online low latency setting optimized for chatbot applications.

Lastly, we are also building out more enterprise features, from security, compliance, access control, to supporting more cloud platforms in our BYOC(bring your own cloud) offering.

As a new AI or ML practitioner, how should I think about incorporating BentoML into my own stack as I’m experimenting and incorporating new models?

For new AI/ML practitioners, definitely try out our open source framework for model serving. It makes it easy to build model inference service, LLM APIs ,inference graphs, or compound AI systems. You can develop everything locally and smoothly transition to the cloud. There are a ton of pre-built project templates and examples to help you get started, and a thriving open source community behind the project to help you out when you run into issues.

If you are coming from a Data Science or AI Research background, know that BentoML is designed to work nicely from your training and experimentation environment, helping to incorporate best practices into your deployment workflow, manage your model versions and environments, package deployable artifacts, and automatically bake in tracing and monitoring components so you can see how your models are performing after they are deployed into production.

🌟 Excited to work with @wearebenlabs to enhance their ML platform with #BentoML and #BentoCloud!
🛠️ Building an ML platform? See how BentoML simplifies model serving and deployment at #BENlabs. 🚀 Don't miss this video! 👀 youtube.com/watch?v=gxw1gM…
#MachineLearning
— BentoML - Run Any AI Models in the Cloud (@bentomlai)
1:55 AM • Apr 3, 2024

If you are coming from GenAI application development using tools such as LangChain or LlamaIndex, BentoML can help turn your RAG pipeline or Agentic workflow into a scalable backend service, with multiple components automatically orchestrated and ready to scale. For example, you can create a service with components ranging from real-time data ingestion pipeline, LLM inference, streaming API processing, and programmatic tools for your LLMs - all managed in one distributed system.

📷📷 @TomTom x BentoML = Advanced AI solutions in navigation! Dive into our latest blog post to see how we're merging maps and AI for smarter mobility. 📷📷 Link:
#TomTom#BentoML#GenerativeAI#MachineLearning#AI
— BentoML - Run Any AI Models in the Cloud (@bentomlai)
1:27 PM • Jan 10, 2024

Which use-case for BentoML excites you the most?

I'm super excited about the growing number of AI applications that are built with multiple models and components instead of a single monolithic model. The most common example is Retrieval-augmented generation (RAG) and - a production-grade RAG system involves dozens of AI models, ranging from LLM, embedding, Reranker, layout analysis, visual reasoning, OCR, semantic chunking, summarization, and more depending on the use case. And you see the same pattern in multi-step chaining strategy in LM apps, image/video generation pipelines and complex OCR, NLP, computer vision tasks.

BentoML is built for scaling such applications from the ground up, allowing developers to dynamically compose and mix multiple model inference steps or custom code, serving heterogeneous CPU, GPU, IO mixed workloads while efficiently utilizing compute resources in a cluster.

🍣🔗 BentoChain
A 🦜️🔗 LangChain deployment example using 🍱 BentoML. Benefits:
🐳 Containerizes LangChain applications
🎱 Generates OpenAPI and gRPC endpoints
☁️ Deploys models as microservices
h/t @bentomlai team for putting this together!
Repo: github.com/ssheng/BentoCh…
— LangChain (@LangChainAI)
4:26 PM • Apr 14, 2023

Talk to us about your perspective on open source. How valuable has this been to Bento's overall platform and the trajectory of your growth?

We believe the future of AI will be open source. From the data used to train foundation models, the training code itself, all the way to the software stack for efficiently running AI models on the edge or at scale in the cloud.

We created two widely used open source AI projects: BentoML, the unified model serving framework, and OpenLLM for self-hosting LLMs. Our team also includes creators or core contributors to projects such as KubeVela, Rest.li, PDM, ArgoCD, OpenAI-translator, and many more.

As a company, we also massively benefit from open source contributions. It not only gives us the opportunity to partner with early adopters in the community gaining valuable product insights. But also, when it comes to the commercial side, it gives immediate brand recognition when we speak to potential customers. Today, almost 100% of our enterprise customers have come to us because they knew of our open source projects or they had already been using them.

Love this video by @MatthewBerman showing how OpenLLM makes it easy for developers to create incredible apps on top of open-source LLMs while having first-class support for tools such as @LangChainAI, @huggingface Agents, and @bentomlai.
— BentoML - Run Any AI Models in the Cloud (@bentomlai)
9:21 PM • Jun 27, 2023

Lastly, how would you describe the culture at Bento? Are you hiring, and what do you look for in prospective team members?

Customer obsessed - something I steal from Ali Ghodsi and Databricks. When we’re deciding what feature to build, diving into the latest AI research, or optimizing LLM inference for example, we always have the bias to do it closely with our customers. This helps us unfold the problems faster, understand customer needs better, and at the end of the day, allow us to build a much better product.

As a team, we love open source and community building. If you happen to be in the San Francisco Bay Area, come join our monthly AGI Builders Meetup!

And we're hiring! If you're an SRE, Solution Architect, or AE interested in early stage startups and like what we are building, come join us!

Conclusion

To stay up to date on the latest with Bento, follow them on X and learn more at them at BentoML.

Read our past few Deep Dives below:

3/15: Baseten is pushing the boundaries of AI Inference ⚡️
3/11: Martian's interpretable alternative to the Transformer 🔌
3/8: Our chat with Groq's Chief Evangelist, Mark Heaps
2/26: MultiOn is building software with a brain 🧠
2/23: Galileo AI's groundbreaking prompt-to-UI tool ✨
2/19: Our chat with OpenAI’s Logan Kilpatrick

If you would like us to ‘Deep Dive’ a founder, team or product launch, please reply to this email ([email protected]) or DM us on Twitter or LinkedIn.

BentoML's unique approach to AI inference 🎛

Plus: Founder & CEO Chaoyu on AI infra and scaling inference...

CV Deep Dive

Our Chat with Chaoyu 💬

Conclusion

Join Slack | All Events | Jobs | Home