Cerebral Valley
Posts
Lambda's 1-Click Clusters is your on-demand GPU multi-node superpower 🔋

Lambda's 1-Click Clusters is your on-demand GPU multi-node superpower 🔋

Plus: Robert Brooks IV, VP of Revenue, on Lambda's AI growth...

June 05, 2024

CV Deep Dive

Today, we’re talking with Robert Brooks IV, Founding Team & VP of Revenue at Lambda.

Lambda is a leading GPU Cloud platform for AI training, fine-tuning and inference workloads. Founded in 2012 at the zenith of the deep learning revolution, the company has grown into the ML community’s favorite GPU Cloud. Lambda has amassed hundreds of thousands of users on its cloud, including large enterprise customers such as Microsoft, Rakuten & Intuitive Surgical. Lambda’s GPU Cloud is also immensely popular with AI startups serving companies like Anyscale, Writer.com, Pika & Covariant.

This week, Lambda announced 1-Click Clusters, the world’s first on-demand and self-serve GPU cluster in the cloud. 1-Click Clusters are rentable for weeks instead of months and with no long-term contract required. The goal of 1-Click Clusters is to enable teams to leverage Lambda’s multi-thousand GPU cloud cluster for rapid training and fine-tuning of large AI models at a moment’s notice, without needing to partake in a traditional sales process involving multiple meetings and begging for ‘street pricing’.

1-Click Clusters has already seen widespread adoption amongst Lambda’s existing set of customers whilst in pre-general availability, with strong interest from companies working on multimodal AI and drug discovery, to name a few.

Introducing Lambda 1-Click Clusters: On-Demand GPU Clusters featuring NVIDIA H100 Tensor Core GPUs with Quantum-2 InfiniBand. No long-term contract required.
bit.ly/1-Click-Cluster
— Lambda (@LambdaAPI)
6:29 PM • Jun 3, 2024

In this conversation, Robert walks us through how Lambda is revolutionizing GPU cloud compute for Machine Learning Engineers & Researchers across AI startups & the enterprise, and why 1-Click Clusters is the superpower that AI teams have been waiting for.

Let’s dive in ⚡️

Read time: 8 mins

Our Chat with Robert 💬

Robert - welcome to Cerebral Valley! Excited to have you here. First off, give us a bit about your background and what led you to join Lambda in 2018?

Hey there! I’m Robert, and I lead all-things revenue at Lambda. I was part of Lambda’s founding team, which is really just a fancy title to say that I was one of the first ten employees at Lambda. I joined in 2018 when deep learning was all the rage, especially after what happened with AlphaGo and DeepMind. They put out a documentary, I watched it, and immediately knew I had to be a part of this space.

I'm non-technical, so I lead all aspects of revenue, including marketing, sales, customer success, revenue operations, and partnerships. Joining a machine learning/deep learning company in 2018 as a non-technical person was essentially unheard of - but I decided to join after meeting Stephen, our CEO. I realized this wasn't just a compute company focused on expanding revenue or sales. These were machine learning engineers and researchers who cared about giving back to the ML community and building software and hardware that promoted the growth of deep learning and training large models.

Keep in mind, large models back then are small models today - the transformer was just beginning to be widely discussed in the ML community right around the time I hopped aboard. Since joining, my focus has been on revenue, and we've grown from a couple million dollars when I joined to almost a billion.

Lambda is obviously a leading name in the world of GPU cloud platforms. How would you describe Lambda to a new AI team focussing on training or fine-tuning?

Lambda is a GPU cloud for AI workloads, mainly training, fine-tuning and inference. Nowadays, there are many startups claiming to do the same, and I love that. However, we've been around since 2012, so we have 12 years of experience, big numbers, several hundred employees, and offices across the United States.

When machine learning teams evaluate clouds, it's crucial to consider that the big hyperscalers cater to every single workload across web, mobile, and GPUs - AI is certainly part of their footprint, but they are not specialized. There are also tier-2 GPU clouds that we compete with, but they still take a horizontal approach and cater to VFX, rendering, computational science, and other GPU-related tasks. This makes their software stacks and support teams horizontal resulting in less of a pure play focus and support of AI workloads

Lambda is unique because we have chosen to focus exclusively on building a GPU cloud for AI training and inference. We are a niche within a niche, which allows us to provide specialized support and optimized software stacks tailored specifically for AI workloads.

Everyone gets excited about “XYZ functionality, now in your browser!”
Yesterday @LambdaAPI released by far one of the coolest new things you can do ‘in your browser’— the ability to spin up tens of millions of dollars of GPU clusters on-demand with no human in the loop!
— Robert Brooks IV (@boborado)
2:11 PM • Jun 4, 2024

Walk us through the arc of Lambda’s early focus on machine learning, all the way to today’s generative AI revolution.

When I joined, we were really focusing on convolutional neural networks, GANs, and LSTMs. The concept of a large language model didn't exist yet, and NLP was just starting to become a focus.. We've seen a shift from focusing on bounding boxes to dealing with context lengths in the millions of tokens, which has been fun to experience.

Back then, we engaged with large companies—some of the top insurance, financial, manufacturing, and aerospace companies in the world—who were just hiring their first machine learning engineers or researchers and procuring Lambda workstations, servers & clusters for AI training.We started by building on-premise clusters for customers because procuring a GPU in the cloud back then meant using a K80 or maybe a V100, which was extremely expensive compared to owning one. Also, those older cloud GPUs were not as performant as some of the gamer cards available in 2017/2018, most notably the 1080Ti.

As time has gone on, we've seen the importance of interconnectivity with GPUs like the A100, H100, and B200, especially in the era of large language models.

These early machine learning teams had to start not only with GPU computation but also with understanding the end applications for their businesses. They often didn't know what the final application would be, but they had to have a thesis. It's really inspiring to see large enterprises now that are AI-native and have teams of hundreds of machine learning engineers and researchers.

How has enterprise adoption shifted since you started at Lambda? Has the advent of generative AI pushed more enterprises to approach Lambda for GPU capacity?

I would say true enterprise adoption beyond the mega hyperscalers like Microsoft and Meta is still lagging. We have a legacy on-premise business closely tied to traditional enterprise sales with companies like Sony, Samsung, Intuitive Surgical, and Airbus. the on-premise side of our business has a strong connection to these traditional enterprises that took AI seriously in 2018.

Comparing that to the growth of our cloud business, we're seeing much wider adoption from companies like Rakuten, the AI Institute, Pika and Anyscale. These companies aren’t thinking about GPUs in the tens - they’re thinking in terms of hundreds or thousands. I think enterprises do still need to catch up in this regard.

We actually work with many companies that are training new neural network architecture models, not just relying on fine-tuning open-source models and transformers. They're trying to create novel architectures, and this is where we see a lot of interesting use cases.=. When I speak to large financial services companies, they are very comfortable adopting an open-source model and fine-tuning it on their data. They don't necessarily want to pre-train and start from scratch.

I wouldn't have predicted this - initially, I thought that training from scratch and having proprietary models, while still owning all the data, would be the focus. But it seems fine-tuning is more appealing to many enterprises. However, I do expect enterprises to start looking at pre-training much more in the future.

Let’s dive into 1-Click Clusters, Lambda’s latest GPU product. This is something you’ve been working on for a while - what should enterprise ML teams be most excited about here?

Absolutely - so, mid-market and enterprise are really where Lambda started with our on-premise business and where Lambda is going with our cloud. Ultimately, we cater to enterprises, large machine learning teams, and well-funded AI startups that need clustered GPU solutions.

There are two main options today: Reserved contracts that span 1-3 years & instant contracts for short-term needs, what we call 1-Click Clusters. Over the last 18 months, we've built the business with several hundreds of millions of dollars worth of contracts sold, building out what we call a Reserved GPU Cloud offering. Essentially, these are single-tenant H100 + InfiniBand clusters that range from 2,000 to 12,000 interconnected GPUs. The primary workloads include pre-training from scratch, fine-tuning large open-source models, with big yolo runs that can span 30 to 90 days.

Inference is also a key workload for our Reserved GPU Cloud but only by the largest AI labs that can afford running inference at this scale. These clusters are not custom; they are strictly InfiniBand plus H100. They include petabytes of storage from some of the predominant storage providers. Typically, there’s a lead time of about 90 days from order to build.

What we're solving for ML teams is the concept of 1-Click Clusters. We want to give machine learning teams and enterprises the ability to self-serve into datacenter-scale AI compute without speaking to a human. This involves an on-demand H100 plus InfiniBand cluster that can be spun up in hours and only rented for weeks. For example, many AI startups that have raised massive Series A rounds have to wait months to prove out their thesis. 1-Click Clusters allows them to prove their thesis in hours without having to spend $5 to $10 million to reserve a large contract for one to three years.Most importantly, 1-Click Clusters are available today, right now, at this very minute! They are truly on-demand with no hoops to jump through like with hyperscalers.

We believe this will revolutionize machine learning teams’ training and fine-tuning capabilities, helping them to get to market faster and not having to shell out 30% to 50% of their seed or Series A funding to prove out their thesis.

‼️‼️Lambda has solved something only a couple of other (very large) companies have: partitioning a large Infiniband deployment (thousands of H100s) to make smaller virtualized clusters
However, only Lambda makes these self-serve (no humans in the loop!) and... on-demand! Spin… x.com/i/web/status/1…
— Robert Brooks IV (@boborado)
7:54 PM • Jun 3, 2024

This absolutely sounds like it’s going to appeal to startups and teams who need to prototype quickly. Any interesting customers or verticals that you think this might be most impactful for?

We've been in a pre-general availability phase for the last 30 days with this product. Given that we have well over 5,000 customers on our cloud, we've been working with existing users without having to do any significant outbound marketing.

We had some initial theses around AI startups in specific verticals like image and video, as well as drug discovery, and these theses have been proven correct - for example, we've seen a ton of AI drug discovery companies latch onto this product immediately. They need 512 GPUs, all interconnected for four weeks, to prove their model, and then come back to either do a reserved contract with Lambda to further develop their model or go back to the on-demand cloud for inference or fine-tuning. That’s been extremely validating.

What we maybe didn't expect was the enterprise demand, even from financial services and healthcare companies, which traditionally have sensitive data. The ability to find large interconnected GPUs in the market that function with high uptime is rare - the key word here is "uptime." The fact that Lambda has been around for a long time and wasn't just a crypto company that pivoted to AI in the last couple of years, or a company just starting to deploy its first few hundred GPUs, gives us credibility. We have an established brand within the machine learning community and existing relationships with enterprises. This allows companies to understand who we are, trust us, and make that bet.

As we go to market, we're definitely going to be targeting AI-native enterprises, as well as AI startups with this product.

There’s a ton of attention on the workload splits between training and inference - as one of the leading providers of compute, could you give us Lambda’s perspective on whether one is more of a priority than the other?

It’s a really hard question to answer because it changes by the minute. Essentially, we have a very large on-demand cloud with tens of thousands of NVIDIA GPUs - instances range from single H100s to our 1-Click Clusters, which includes up to 512 H100s connected with InfiniBand.

In the world of 1x and 8x GPUs, there's a lot of multi-GPU training, fine-tuning, and what we call "scrappy inference," which operates without high SLAs and several nines of uptime. The majority of these workloads focuses on fine-tuning. In our on-demand cloud, there’s a high propensity for scrappy inference—using our on-demand cloud without paying for a service level agreement, not banking on several nines of uptime and redundancy, and being able to serve generative AI applications owing to the reliability of our on-demand cloud.

We have an inference platform announcement coming soon—nothing to talk about today, but that’s where we see a lot of focus. Speaking of inference, we have an adage here: if you ask ten machine learning companies what training is, you'll get roughly the same response from all ten. If you ask ten companies what inference is, you'll get very different responses A lot more to say on Inference in the very new future.

Regarding our 1-Click Clusters product and our traditional reserved cloud product, we have deployments of contiguous clusters with anywhere between 4,000 to 12,000 GPUs under a large InfiniBand network. This is primarily where we see a lot of training focus, including training from scratch and pre-training.

1-Click Clusters are on-demand with self-serve access from the Lambda Cloud Dashboard. Spin up as many as 512 NVIDIA H100 Tensor Core GPUs for as little as 2 weeks.
— Lambda (@LambdaAPI)
5:29 PM • Jun 4, 2024

Talk to us a little bit about what differentiates Lambda from other players in the space. What’s unique about the team and the business you’ve built?

I want to start with the fact that we've been around since 2012. We were founded by machine learning engineers and have published machine learning research at the largest academic conferences in the world, including NeurIPS, ICCV, and SIGGRAPH. Our founders built a computer vision application that was doing generative AI before generative AI was even a known term - they used convolutional neural networks and GANs to create a surrealistic image by combining a photo of yourself with a painting by Dali, etc. This was back in 2016, when having an application on your iPhone that used neural networks to generate images was mind-blowing.

We survived as a business because we built our own GPU cluster out of a bunch of 1080 Tis. If we had kept running that application on a large hyperscaler cloud, we would have run out of money, and Lambda wouldn't exist today. Building our own cluster allowed us to understand how difficult AI infrastructure is for ML teams. Greg Brockman often talks about the orchestration of GPUsfor large scale training being one of the hardest problems to solve, and that's Lambda's business. That's what we do every single day.

Our core focus on serving machine learning, without getting sidetracked by trends like crypto, computational science, VFX, or rendering, has allowed us to make customers really happy with our infrastructure dedicated to AI.

Lastly, give us three things you’re most excited about, either internally at Lambda or more broadly around the GenAI space.

Absolutely. So I'll start with Lambda and then end with the broader AI space.

At Lambda, our 1-Click Clusters product has never been done before - it's completely differentiated and is a superpower for machine learning teams. The ability to self-serve into data center-scale AI clusters without talking to a human being, without being gouged on price, and being able to rent for just a couple of weeks is something that's never existed in the machine learning community. I'm excited to see the adoption of this product once we start publicly revealing it.

In the broader AI space, we're seeing a nice resurgence in humanoid robotics. We work with a lot of these companies and know what they’ve been doing for quite some time. The fact that we're getting closer to having a $10,000 robot in your house that can fold your laundry is super exciting, and I'm grateful that Lambda has played a role in that too.

I also want to stress a couple of ideas around where things are going in the machine learning community, especially around open source. Lambda is actively seeding a lot of chips to companies that are training open-source models and giving back to the ML community. You might not see it often, but if you're in my Twitter DMs working on something extremely novel, we will support it. We're doing this not just with entry-level GPUs but with high-end models like the GH200, which cost a lot of money, and we're providing clusters of them. We're even working on the H200 right now.

This is what we’re trying to do, giving back to the machine learning community and making sure that we continue to accelerate - any AI startup, even if they’re “GPU poor” – has the ability to produce something without being bottlenecked by GPU availability across other clouds.

Anything else you’d like people to know about 1-Click Clusters?

The last thing I’ll say is that we're at a point where we deploy roughly $100 million worth of AI infrastructure into our cloud every couple of months, and this is super exciting because it's providing liquidity to our cloud. Our cloud is known on Twitter for being sold out a lot - when we make $100 million worth of GPUs available, it typically gets taken up in about ten business days.

Now, we’re starting to deploy in much larger chunks—several hundreds of millions of dollars worth of GPUs in our cloud. We're aiming to reach a point where you don't show up to Lambda's cloud and see a sold-out notification. There will be GPU liquidity for quite some time - 1-Click Clusters is one representation of that. We've built our brand on the 1x and 8x single instances, but there's a massive influx of GPUs coming in the latter half of this year and right now.

Conclusion

To stay up to date on the latest with Lambda, follow them on X and learn more about them at Lambda.

Read our past few Deep Dives below:

If you would like us to ‘Deep Dive’ a founder, team or product launch, please reply to this email ([email protected]) or DM us on Twitter or LinkedIn.

Lambda's 1-Click Clusters is your on-demand GPU multi-node superpower 🔋

Plus: Robert Brooks IV, VP of Revenue, on Lambda's AI growth...

CV Deep Dive

Our Chat with Robert 💬

Conclusion

Join Slack | All Events | Jobs