FriendliAI - The Inference Engine Behind vLLM

Plus: CEO Gon Chun on why open-source models are eliminating the "convenience tax," building a proprietary engine from scratch, and the $50K Switch campaign for teams ready to leave closed APIs behind...

CV Deep Dive

Today, we’re talking with Byung-Gon Chun (Gon), Founder and CEO of FriendliAI.

If you’ve used vLLM, you’ve used technology that traces back to Gon’s research. His lab at Seoul National University published the ORCA paper, which introduced Continuous Batching, a technique that fundamentally changed how LLM inference engines handle concurrent requests. ORCA directly inspired vLLM and has since become an industry standard across virtually every major serving framework. The team behind one of the most influential advances in modern inference infrastructure is now building the platform to make it all accessible.

FriendliAI is a world-class inference platform that helps teams deploy and scale large open-source and custom AI models efficiently, reliably, and at significantly lower cost. Built on a proprietary inference stack optimized across batching, quantization, scheduling, caching, and GPU kernels, Friendli positions itself as a premier alternative to closed model providers like OpenAI and Anthropic, offering comparable performance and convenience for the world’s best open-source models without the proprietary markup. The platform supports serverless APIs, dedicated endpoints, and container-based deployments including fully on-prem setups, giving teams the flexibility to choose the right deployment without being locked into a single approach.

Friendli was founded on a conviction that training was getting all the attention, but inference was quietly becoming the real bottleneck where cost, latency, and reliability challenges compound as AI moves from research demos into production systems serving millions of users. Today, the platform serves AI-native startups, enterprises deploying LLMs and multimodal models at scale, and model providers seeking broader adoption without the burden of operating massive inference infrastructure. Customers use Friendli for chat applications, code generation, agent systems, multimodal AI, and internal enterprise copilots, often achieving 50%+ cost savings by switching workloads from closed APIs to Friendli’s optimized open-source stack.

Notable partnerships include LG AI Research, where Friendli serves as the official inference partner for their EXAONE family of models, and Twelve Labs, whose industry-leading video understanding models rely on Friendli’s engine for compute-intensive video inference at scale. The company also recently launched a Switch campaign offering up to $50,000 in inference credits for teams ready to migrate off closed providers, a signal of their confidence in the economics.

In this conversation, Gon walks us through the founding story of FriendliAI, what sets their inference stack apart in an increasingly crowded landscape, and why he believes the era of the “convenience tax” is ending for teams ready to take control of their model infrastructure.

Let’s dive in ⚡️

Read time: 8 mins

Our Chat with Gon 💬

Gon, welcome to Cerebral Valley! Your research has shaped how most of the industry does inference today. Give us the backstory on you and Friendli.

I’m Byung-Gon Chun, people call me Gon, and I am the Founder and CEO of FriendliAI. Before starting Friendli, I was a professor at Seoul National University working on large-scale machine learning systems, particularly around efficient inference and training of large AI models. A lot of our early work focused on how to make these cutting-edge models actually usable in production.

One of our most well-known contributions is ORCA and Continuous Batching. ORCA inspired vLLM and several other open-source frameworks, and Continuous Batching has become an industry standard for efficient LLM inference. That work gave us a front-row seat to just how painful the inference problem really was, even for well-resourced teams.

Friendli really grew out of that frustration. Teams were spending enormous amounts of time and money to serve these models fast and reliably. When we started the company, training was getting all the attention, but I knew that inference was becoming the real bottleneck. We started Friendli to solve that problem end-to-end and to make high-performance inference accessible without requiring every company to build its own infrastructure from scratch.

It wasn’t a single ‘eureka’ moment, but rather a wave of realization. When we published the ORCA paper at OSDI, we knew we had solved a fundamental inefficiency in how GPUs process requests. But seeing the open-source community, specifically projects like vLLM, adopt these ideas so rapidly was the real validation. It was surreal to watch a technique we developed in the lab become the default standard for the entire industry almost overnight.

That transition from academic paper to industry standard was actually the spark for Friendli. We realized that while the concept of continuous batching was out there, implementing it reliably at enterprise scale, with the necessary scheduling, caching, and redundancy, was a completely different beast. I saw brilliant teams struggling to build this infrastructure in-house, and it became clear that the world didn’t need just another open-source library; they needed a robust, enterprise-grade platform that just worked.

You use the phrase “convenience tax” to describe what closed providers charge. Break that down for us.

Friendli is a world-class AI inference platform that helps teams deploy and scale large open-source and custom AI models efficiently, reliably, and at much lower cost. We see ourselves as a premier alternative to closed model providers like OpenAI and Anthropic. Those providers offer high-quality models, but they essentially charge a convenience tax through their proprietary APIs. Friendli provides the same level of convenience for the world’s best open-source models without that markup.

For developers, that means using state-of-the-art open-source models via a familiar API with low latency and high throughput. For AI teams, it means they don’t need to worry about GPU utilization, kernel optimization, or auto-scaling and operations. We handle that under the hood so they can focus on building their AI products, not managing inference infrastructure.

Who’s getting the most value from Friendli today? Paint us a picture of your typical customer.

Our core users fall into three groups: AI-native startups, enterprises deploying LLMs or multimodal models at scale, and model providers who want broader adoption without running massive inference operations. The teams seeing the most value are those operating at scale, where moving away from closed model ecosystems to optimized open-source inference directly improves product quality and their bottom line.

A great example is actually happening right now in the developer tools space. We’re seeing a huge surge in usage from thousands of individual developers and popular coding agents, often accessing us through platforms like OpenRouter, where we often win due to our speed and high reliability.

Previously, many of these developers felt locked into expensive closed models like Claude for their coding tasks because they believed open-source alternatives couldn’t compete. But that gap has closed dramatically. Models like GLM, MiniMax, DeepSeek, and Kimi have gotten extremely good at code generation and reasoning.

We’ve seen developers switch their agent backends from Claude to running these high-performance open-source models on Friendli. The feedback is consistent: they are getting comparable coding proficiency, but at a price point that makes their unit economics actually work. For a coding agent that might make hundreds of inference calls to refactor a single codebase, that cost difference is the difference between a viable product and burning cash. They come for the savings, but they stay because the latency on our engine makes the ‘feels-like’ performance of their agents significantly snappier.

Talk to us about some existing use cases. What kinds of workloads are people running on Friendli?

Friendli is used for chat applications, code generation, agent systems, multimodal AI, and internal enterprise copilots. Many customers come to us after hitting scaling limits and experiencing challenges such as unpredictable latency, inefficient GPU usage, or rising costs. By switching from closed APIs to Friendli-powered open-source models like Qwen, DeepSeek, Kimi, GLM, or Llama, they often see significant latency improvements and much better cost efficiency without changing their application logic.

We’re also working with several leading coding agents, helping them deploy and scale both open-source and custom fine-tuned models. These agents require low latency and high throughput to provide a seamless user experience, and our platform allows them to achieve that while maintaining total control over their model stack.

Another exciting partnership is with Twelve Labs, who use Friendli to run inference for their industry-leading video understanding models. Video inference is incredibly compute-intensive, and our engine ensures they can deliver fast, accurate insights from video content at scale. It’s a great example of how our stack handles workloads that go well beyond standard text-based LLM serving.

For a team that’s curious but hasn’t tried Friendli yet, how fast can they get something running?

It’s incredibly easy. Most customers get started by selecting a model and redirecting their existing OpenAI-compatible API calls to Friendli. It’s essentially a drop-in replacement with the same API format and dramatically better economics.

One of our most powerful features is our one-click deployment for over 500,000 open-source models from Hugging Face. Whether you need serverless APIs, dedicated endpoints, or container capabilities for on-prem needs, you can move into production within minutes. We recommend starting with latency-sensitive workloads, high-throughput inference, or long-context reasoning use cases to see the platform’s strengths firsthand.

Tell us about the LG AI Research partnership. That’s a unique position for an inference company to be in.

Our partnership with LG AI Research is a great example of how we empower model providers to reach developers at scale. LG is building incredible models like K-EXAONE, which is a massive 236 billion parameter hyper-attention mixture-of-experts model. Friendli serves as the official inference partner, providing both serverless and dedicated endpoints for the EXAONE model family.

This allows developers to access state-of-the-art Korean and English language capabilities with the performance and reliability of the Friendli inference stack. By handling the complex infrastructure required for a model of that scale, we are helping LG reach a much broader audience of developers who can now integrate EXAONE into their applications with just a few lines of code. For model providers, that’s the value proposition in a nutshell: we take the serving burden off their plate so they can focus on model development and community.

How are you measuring impact for your customers? What does success look like in hard numbers?

We focus on latency, throughput, reliability, and crucially, cost per token. By moving workloads from closed models to our optimized open-source stack, our customers often achieve cost savings of 50% or more. That’s not a marginal improvement. For teams running inference at scale, it fundamentally changes their unit economics.

Ultimately, success means customers can scale their AI products and usage without scaling their operational complexity or budget at the same rate. That’s the bar we hold ourselves to.

You recently launched a Switch campaign with some serious incentives. What’s behind that, and who should be paying attention?

We actually just launched the Switch campaign specifically because we know the main barrier to moving off closed APIs isn’t just technical, it’s the risk of transition. Teams have built their applications around a specific provider’s API, and even when the economics don’t make sense anymore, the switching cost feels high.

To help with that, we’re offering up to $50,000 in inference credits for eligible teams that switch their workloads to Friendli. It’s designed to give companies the runway they need to optimize their implementation and see the 50%+ cost savings for themselves, without any upfront financial risk. If you’re hitting scale and the convenience tax is starting to hurt, this is the time to make the move.

The inference space is getting crowded. What’s your honest take on what makes Friendli different from the other players?

Friendli is built on a proprietary inference stack, highly optimized across batching, quantization, scheduling, caching, and GPU kernels. Our architecture is flexible and decoupled, supporting serverless and dedicated endpoints as well as container-based deployments including on-prem and without forcing a single deployment model. That matters because different customers have very different requirements around data residency, security, and infrastructure preferences.

This flexibility, combined with our world-class performance on both custom and open-source models, makes us a true infrastructure partner, not a model provider. We’re not asking you to adopt our model ecosystem, we are making the models you choose run better, faster, and cheaper.

The biggest misconception is fixating solely on the ‘sticker price’ per million tokens. Teams often look at a pricing page and choose the cheapest option, assuming all tokens are created equal. They aren’t.

What matters in production is performance stability. You might find a provider that is cheaper on paper, but if their latency spikes during your peak traffic hours, or if their throughput degrades when you scale, you lose users. The ‘convenience tax’ isn’t just about money; it’s about the hidden cost of unreliable infrastructure. We tell teams to measure ‘performance per dollar’ under load, not just the price of a token in a vacuum. That’s usually where the decision becomes obvious.

Take us under the hood. Why build your own engine from scratch instead of building on top of existing open-source frameworks?

Friendli vertically optimizes AI model execution and infrastructure management end-to-end for inference. We started with our own engine from the ground up, highly optimized for fast inference to avoid framework overhead, and built in extensive redundancy for reliable operation. We didn’t want to layer on top of existing open-source serving frameworks, what we wanted was full control over every layer of the stack so we could push performance beyond what’s possible with off-the-shelf solutions.

This enables consistent performance across all deployment modes. The system adapts in real time to dynamic inference workflows, ensuring that even under heavy load, the user experience remains fast and responsive. Whether you’re running a single model or orchestrating across multiple, the engine handles the complexity under the hood.

What’s been the hardest engineering problem you’ve had to solve?

The hardest challenge has been optimizing latency, throughput, and cost simultaneously across diverse real-world models and workflows. It’s common to optimize for one dimension at the expense of another. You can get low latency if you throw enough compute at it, or you can get great throughput if you’re willing to batch aggressively and accept higher latency. Doing all three at once, across a huge variety of model architectures, input patterns, and scale requirements, is where the real engineering difficulty lives.

Achieving this required deep work across system runtimes, GPU kernels, and large-scale distributed infrastructure, with constant validation in production. The research background of our team has been critical here. The same rigor that went into ORCA and Continuous Batching carries through to everything we build on the platform side.

What should people expect from Friendli over the next 6-12 months? Where are you headed?

We’re focused on unlocking higher-speed inference on GPUs beyond the limits typically seen with other service providers, while expanding high-speed inference for multimodal models and strengthening our self-serve platform. We want to make it even easier for teams to escape the closed model trap by providing the fastest, most reliable way to run any custom or open-source model.

We are also seeing a massive shift toward agentic workloads. When you move from simple chatbots to agentic systems and reasoning models, the inference profile changes completely. An agent might need to ‘think’ for several steps, call tools, and iterate before answering the user.

In a chat app, a 200-millisecond delay is annoying. In an agent loop that runs ten internal steps to solve one problem, that delay compounds into seconds of dead air. Low latency becomes non-negotiable. We are seeing customers specifically asking for optimizations around low response latency, fast token generation, and efficient context caching because their agents are reading massive amounts of data and iterating on it in real-time. Friendli’s engine is uniquely suited for these agentic workloads because we optimize the entire execution path, not just the batching.

Lastly, tell us about the team. What’s the culture like, and what kind of people thrive at Friendli?

We’re a deeply technical, execution-driven team with strong roots in research and engineering. We’re actively hiring and looking for top talent in both the San Francisco Bay Area and in Korea across the company. We value people who are genuinely enthusiastic about the field, take initiative, and enjoy collaborating with others. Technical curiosity and a strong sense of ownership shape how we work and how we build the systems our customers rely on.

Inference is becoming the backbone of AI products. The teams that win won’t only have the best models but they’ll also have the best systems to run them, especially as open-source models continue to advance and reach parity with closed alternatives. That’s the problem we’re obsessed with at Friendli: world-class infrastructure that gives developers freedom to innovate.

Stay up to date on the latest with FriendliAI, follow them here.

Read our past few Deep Dives below:

If you would like us to ‘Deep Dive’ a founder, team or product launch, please reply to this email ([email protected]) or DM us on Twitter or LinkedIn.

CVInstagramXAll Events