Cerebral Valley
Posts
DataChain is AI's advanced data infra platform 🔋

DataChain is AI's advanced data infra platform 🔋

Plus: Cofounder/CEO Dmitry on CV, LLMs and unstructured data...

August 13, 2024

CV Deep Dive

Today, we’re talking with Dmitry Petrov, Founder and CEO of DataChain.

DataChain is an advanced data infrastructure platform designed for AI teams to manage, curate, and process large-scale unstructured data. DataChain emerged from Dmitry’s experience in data management and his success with the open-source project DVC (Data Version Control). The company's mission is to provide AI developers with the tools needed to build and maintain high-quality datasets for training models and developing LLM applications, addressing the complex challenges of handling vast amounts of unstructured data.

Today, DataChain is used by teams working on AI models and LLM applications, particularly those dealing with computer vision, LLM and other data-intensive tasks. The platform supports users in organizing, querying, and updating massive datasets efficiently, making it a valuable tool for companies looking to scale their AI operations. Since its launch, DataChain has attracted a growing community of users and contributors, and its open-source data frame library has become a critical component of its success.

In this conversation, Dmitry takes us through the founding story of DataChain, the unique challenges of building a data infrastructure platform for AI, and the company’s roadmap for the next 12 months.

Let’s dive in ⚡️

Read time: 8 mins

Our Chat with Dmitry 💬

Dmitry - welcome to Cerebral Valley! First off, give us a bit about your background and what led you to found Datachain?

Hey there! My name is Dmitry, and I’ve been working with data for quite a while. My PhD focused on data storage and management, and I was a data scientist at Microsoft Bing for several years. When you’re in that role, most of your time is spent on feature engineering for modeling or analytical tasks, and this led to the creation of DVC (Data Version Control). I started the open-source project DVC to track models back to the data, especially raw data in storage, where you don’t have databases behind it. The project became popular, so I founded a company around it. Now, as CEO, I’m working with customers on the roadmap, the product itself, and keeping a big focus on data.

Our goal is to figure out how to work with data on your S3 drive or other unstructured storage, how to connect data with models, and how to ensure reproducability. That’s how we arrived at DataChain. We realized DVC does a good job tracking data and connecting data to models and data sources in storage, but a bigger problem emerged. Using old terminology, feature engineering became a bigger challenge. The question now is, “I have millions of files, documents, images, or videos—which ones do I need for my model? Which do I need for training?”

DVC couldn’t answer that question because it requires annotations, running inference, and making decisions based on that inference, like including or excluding data. Now, with LLMs, you can take a file, throw it to something like ChatGPT, and ask, “How many people are in this image?” If there are more than 20 people or no people at all, maybe you don’t need that image for your human face detection algorithms. People are using a lot of different tools these days for what we used to call feature engineering, but now it’s more about data curation. That’s how DataChain got started.

Give us a top level overview of DataChain - how would you describe the startup to those who are maybe less familiar with you?

Think of it like a data frame we've designed specifically for working with files and AI data structures. You start by getting files and building a kind of virtual table of these files. For example, you might have millions of files stored in S3, and in your table, you keep records of these files with details like checksums and other metadata.

Then, you can take those files and run them through a model like Mistral. You get responses and store those responses right next to the corresponding file in your table. So for each file, you know what the response was, how many tokens were used, and so on.

From there, you can keep going deeper, adding more layers of analysis. With each layer, you introduce more structure, making your data frame more valuable. You start with just a reference to the file, then add the Mistral response, then maybe a number indicating how many people are in the image, a boolean value for whether it's night or day, a vector for similarity search, and so on. You keep adding more data and structure, stage by stage, until you have a richly detailed data frame.

Who are your users today? Who’s finding the most value in what you’re building with Datachain?

We’re mainly targeting teams that are either building AI models and need high-quality data for training or those developing LLM applications. However, not every team will benefit from our tool right away. Teams need to reach a certain level of maturity first—where they’re focused on cleaning and curating their data, as well as evaluating the results of their training or applications. That’s when we can provide the most value by helping you prepare high-quality, curated data.

How do you measure the impact that Datachain is having on your key customers? Any customer success stories that you’d like to share?

We started with computer vision a bit before the LLM boom, and even today, a significant portion of our clients are still focused on images and videos. It begins with a simple equation: how do you keep track of 300 million annotated files? That’s the first problem. You can’t just fit that into a regular data frame—it won’t fit, and your machine doesn’t have enough memory. If you put it in a data warehouse, you’d have to deal with SQL and other complexities, which isn’t easy for AI engineers. That’s where DataChain can help. It can build you a table or data frame that’s out of memory, handling something like 300 million files.

You can query based on certain attributes, like needing files only from a specific directory or with a particular file pattern—say, by the date of creation. For example, you might need only files from the last month. Once you have this high-scale dataset, the next question is how to update it. Imagine you have a directory with files—whether it's thousands or just 25 files—the question becomes, "What’s changed since we last built and shipped our application?" Maybe something changed in the bucket—did we get any new files? Were any files updated? You want to know exactly what happened.

So, you need a way to differentiate between basic file storage operations, especially in S3, to see that maybe only 20 files were added, or perhaps 30,000 images were created. Then, you can apply what we call data updates or data transformations. For instance, one of our customers spent over a week of compute time on a powerful GPU machine just to score their data and compute embeddings for images. It was a massive resource investment. Then, they received an update with a few thousand new images. The question is, do you want to waste another week of GPU time rescoring the entire bucket? Probably not. You want to know what specifically changed in the bucket.

When you identify the couple thousand new or changed images, you can compute the vectors only for that new data and then create a dataset that combines the new and old files, along with their embeddings. Instead of wasting a week of GPU compute time, you might spend just half an hour on a delta update and then union everything together.

Open-source has been a huge part of your work with DataChain. Could you describe the critical nature of open-source to your work thus far?

DataChain actually started as a SaaS product on customer sites rather than as an open-source project. This is a big difference from DVC, which was initially created as open-source, and the SaaS version came about four years later. With DataChain, we did the opposite, and that approach has its own pros and cons.

We decided to open-source one crucial part of DataChain: the data frame library. This library handles how to define data frames, how to serialize them, how to work with storage, file references, and objects like LLM responses. The idea is that you can get an LLM response and serialize it without worrying about SQL or other complexities. It also includes logic for handling changes in a bucket and deciding what to do next with the data.

This open-source functionality works on your local machine with a local embedded database. The goal is to help people curate their data, understand what’s inside it, and extract insights from their files. For us, as a company, open-sourcing this library is essential for getting feedback from the community. Since we open-sourced DataChain just two weeks ago, we’ve already received two pull requests from external contributors and some ongoing discussions, which is really valuable information.

The community feedback helps drive the product in the right direction. We want to lead in the areas of data curation, data analytics, and evaluation, particularly in the context of files and unstructured data. When it comes to structured data, there’s already a well-established field with a beautiful landscape of tools—data warehouses, ETL, BI, visualization tools, statistical techniques, and so on. It’s a great world of data management and analysis. The challenge now is to transfer that knowledge and those skills to people working with LLMs and computer vision.

People are starting to realize the importance of this because LLM evaluation is becoming one of the toughest challenges. The first time you put an LLM application in front of customers, it’s exciting and cool, but soon you realize you need more—whether it’s more safety or accuracy. For example, you don’t want your LLM to mistakenly sell a brand new car for $1,000, which is something that actually happened a few months ago. So, evaluation is key to delivering AI that’s reliable and safe for everyone.

How do you plan on Datachain progressing over the next 6-12 months? Anything specific on your roadmap that new or existing customers should be excited for?

We just open-sourced a part of DataChain, and now we're really focused on user use cases—figuring out what people are picking up the most and which use cases we should concentrate on, especially in LLM applications. One area that’s evolving quickly is the evaluation stage of the stack. How are people conducting evaluations, especially at a larger scale? Performance is also a big issue, which is common in data infrastructure. We’re providing the infrastructure, not the smart algorithms, but a way to orchestrate all the pieces together to cover more use cases.

Getting more feedback from users and addressing more use cases is a top priority. We’re focusing on the flexibility of the tool because we’re hearing more and more from users about where they hit limitations with the current workflow. LLM evaluation is a bit unique—actually, it’s new, and that’s the challenge. In this new workflow, you need to try different approaches and see what works. That’s why community input is so crucial. We need broad feedback from dozens or hundreds of users each month to understand which direction to prioritize.

Lastly, tell us a little bit about the team and culture at Datachain. How big is the company now, and what do you look for in prospective team members that are joining?

First of all, we hire for passion. People need to be genuinely excited about this area because what makes our work both challenging and fun is that we operate at the intersection of infrastructure and AI. On the infrastructure side, you might not always know exactly what's needed from an AI perspective, so you need strong feedback from AI folks to really understand how things work. Your engineering intuition might lead you in the wrong direction because AI/ML folks often have a different mindset. If you’ve worked with both sides, you know this well.

At the same time, building this tool purely with AI folks isn’t possible either because it’s deeply infrastructure-heavy, requiring a strong understanding of deep tech. But this mix of the two sides is what makes it really fun—you’re always learning something new, whether it’s about infrastructure use cases or AI problem sets. That’s a crucial part of our team’s dynamic and what drives us.

Open source and being a distributed team are also big parts of our DNA. Our company has been distributed from day one. It’s fun because you get to work with people from different cultural backgrounds and time zones. Sure, time zones can be a challenge, but it’s something we deal with. It’s exciting to be at the crossroads of two different worlds and to experience the best of both sides.

Conclusion

To stay up to date on the latest with DataChain, learn more about them here.

Read our past few Deep Dives below:

If you would like us to ‘Deep Dive’ a founder, team or product launch, please reply to this email ([email protected]) or DM us on Twitter or LinkedIn.

DataChain is AI's advanced data infra platform 🔋

Plus: Cofounder/CEO Dmitry on CV, LLMs and unstructured data...

CV Deep Dive

Our Chat with Dmitry 💬

Conclusion

Join Slack | All Events | Jobs