• Cerebral Valley
  • Posts
  • VESSL AI - Building the Future of AI Agents and Scalable MLOps 🌐

VESSL AI - Building the Future of AI Agents and Scalable MLOps 🌐

Plus: CEO Jaeman on VESSL's vision for the future of compound AI systems...

CV Deep Dive

Today, we’re talking with Jaeman An, Co-Founder and CEO of VESSL AI .

VESSL is a multi-cloud ML infrastructure platform designed to simplify and optimize machine learning workflows. By providing a unified interface for managing on-premise and cloud-based GPU resources, VESSL AI enables companies to seamlessly train, fine-tune, and deploy models across different environments.VESSL AI empowers AI teams to scale efficiently with features like a unified YAML interface and robust reliability tools, including integrations with third-party platforms like GitHub, Hugging Face, Slack, as well as support for on-premise solutions.

VESSL AI is already empowering AI startups and enterprises, from LLM training use cases to automated ML pipelines for companies like Hyundai Motors. By focusing on reliability and scalability, VESSL AI is making complex AI infrastructure more accessible and effective for developers around the world.

In this conversation, Jaeman shares the story behind VESSL AI, the challenges of building a robust MLOps platform, and their vision for the future of compound AI systems.

Let’s dive in ⚡️

Read time: 8 mins

Our Chat with Jaeman 💬

Jaeman - welcome to Cerebral Valley! First off, give us a bit about your background and what led you to co-found VESSL AI? 

Hey there - I’m Jaeman An An, CEO and co-founder of VESSL AI. I previously worked as a software engineer for about seven to eight years. During that time, I built a mobile game that scaled to 10 million users, which involved dealing with global traffic and creating cloud architectures and infrastructure to manage massive loads. My experience was centered around building cloud infrastructure and DevOps solutions.

After that, I transitioned to an AI startup in the medical field, where I developed a deep learning application for predicting acute diseases in hospital patients. We monitored vital signs like blood pressure and used that data to forecast critical conditions, such as heart attacks. Through this work, I realized there needed to be a system similar to DevOps, but for AI—an MLOps system to facilitate continuous integration and delivery of model training and deployment.

That insight led me to start my own company. At VESSL AI, we focus on building MLOps infrastructure. Our first major task was orchestrating and grouping GPU resources across multiple cloud platforms and on-premise setups, enabling scalable model training and deployment. We also expanded to training and deploying large language models (LLMs) for production environments. That's a quick overview of our journey so far.

How would you describe VESSL AI to an AI engineer or developer who isn’t as familiar?

With VESSL AI, users can connect their own cloud or on-premise GPUs. We provide a simple YAML interface for running any kind of machine learning tasks, from training to deployment. It’s just one YAML interface for everything. We also have a web interface that allows users to easily manage training and deployment, including production features like monitoring, auto-scaling, and other reliability tools—all through a single YAML configuration.

Talk to us about your users today - who’s finding the most value in what you’re building with VESSL AI? 

When we started our company, most of our users were AI developers training their own models, often for tasks like image classification and vision models. Since last year, though, we’ve seen a shift, and now many users are using our platform to train and fine-tune their own LLMs. They typically want to utilize their on-premise GPUs while also scaling their GPU services across multiple cloud platforms, like Lambda, Coreweave, and major providers like AWS and Google Cloud. Our platform allows them to merge all these GPU resources into one unified dashboard, making it easy to run LLM training and fine-tuning across multiple clouds and regions.

Any customer or design-partner success stories you’d like to share? 

I can share two of our success stories, which are quite different from each other. First, I want to mention an LLM startup called Ketolab. They use our platform to fine-tune their own foundation models. Their mission involves providing B2B LLM services like Cohere does, which requires them to tailor their foundation models to fit their customers' private data. This means they have to run hundreds or even thousands of fine-tuning jobs in a single week. By using our YAML interface, they can easily manage and scale all their GPU resources without worrying about the underlying infrastructure, enabling them to conduct these extensive fine-tuning experiments efficiently.

The second use case is from the enterprise sector, involving Hyundai Motors. Their case is quite different. They’re building autonomous driving models and need to process data collected daily. To simplify their ML workflow, from data processing and labeling to model training and deployment, they use our YAML interface. They define each step of their ML workflow—data preprocessing, model training, deployment, and so on—in YAML, and then connect these steps to create an automated pipeline. This approach has significantly reduced the complexity of their ML operations and allowed them to build an efficient automated pipeline for their autonomous models.

Compute has become a hugely critical part of generative AI, and has received a lot of interest in the past 18 months. What sets VESSL AI apart from others in the space? 

One of the key differences with our platform is that, unlike most ML Ops or GPU services, which require users to log into a specific cloud provider or use their cloud infrastructure, we support the use of a variety of infrastructures. Many AI startups and enterprises prefer to use their own infrastructure, which includes on-premise GPUs, private clouds, and alternativecloud platforms like  Oracle Cloud Infrastructure. These platforms often provide cheaper GPU services compared to major providers like AWS or Google Cloud, but they lack robust features, making it difficult to build and scale AI infrastructure for production.

Our platform addresses this by seamlessly connecting to multiple cloud platforms and on-premise GPU resources through our proprietary integration technology, which optimizes connectivity and management across different environments. This enables a robust production layer for ML workflows. Essentially, our core difference lies in our ability to integrate with multiple third-party cloud platforms while providing scalable and reliable AI infrastructure.

There's a lot of excitement around agents and multimodal in addition to the growth of LLMs. How does serving those different markets factor into your product vision for VESSL AI? 

For deploying LLMs, our main focus has been on reliability and scalability. We make sure that LLM deployment is stable, fast, and robust. We provide essential features like model monitoring, auto-scaling, and other reliability measures. So, if customers want to build their own AI infrastructure for adaptive deployment or deploy LLMs with strong scalability, our platform stands out as one of the best options for that.

What has been the hardest technical challenge around building VESSL AI into the product it is today?

As I mentioned, we need to connect multiple cloud platforms and regions, and we also have to schedule mixed workloads across these platforms while considering pricing, networking, and storage factors. One of the major challenges we've faced is building a unified data layer that works seamlessly across various cloud platforms, including on-premise GPU resources. This involves optimizing scheduling, storage, and network performance across multiple cloud environments and integrating on-premise infrastructure efficiently.

How do you plan on VESSL AI progressing over the next 6-12 months? Anything specific on your roadmap that new or existing customers should be excited for? 

In the next six months or so, we're planning to strengthen our integrations with various cloud platforms and third-party tools, especially LLM ops tools like LangChain, LlamaIndex, Pinecone, and VBA. We're also focused on building MLOps platforms tailored for specific cloud environments, such as ML ops for Oracle and ML ops for Lambda. Enhancing developer positioning in this area is one of our key priorities. Additionally, we're working on creating agent workflows that integrate ML ops with LLM ops workflows, which will be a major area of focus for us moving forward.

Lastly, tell us a little bit about the team and culture at VESSL AI. How big is the company now, and what do you look for in prospective team members that are joining?

Our team is split between Seoul and San Francisco, with a total of 30 members, 20 of whom are engineers. Many of our engineers have experience from major companies like Google and successful unicorn startups such as Sendbird and HyperConnect. They've previously worked on building global-scale cloud infrastructure and handling massive amounts of traffic. That's a brief introduction to our team.

Anything else you want people to know about VESSL AI?

The last thing I want to mention is our vision: building compound AI systems. We believe AGI will emerge not from a single large model, but from an ecosystem of hundreds of thousands of interconnected models. That's why we're dedicated to developing an ML platform capable of running these layered models efficiently on users' own infrastructure. Our focus is on making these models reliable at scale and then integrating agent functionality to enhance this ecosystem further. That's our ultimate goal.

Conclusion

To stay up to date on the latest with VESSL, learn more about them here.

Read our past few Deep Dives below:

If you would like us to ‘Deep Dive’ a founder, team or product launch, please reply to this email ([email protected]) or DM us on Twitter or LinkedIn.