- Cerebral Valley
- Posts
- DataChain is AI's advanced data infra platform đ
DataChain is AI's advanced data infra platform đ
Plus: Cofounder/CEO Dmitry on CV, LLMs and unstructured data...

CV Deep Dive
Today, weâre talking with Dmitry Petrov, Founder and CEO of DataChain.
DataChain is an advanced data infrastructure platform designed for AI teams to manage, curate, and process large-scale unstructured data. DataChain emerged from Dmitryâs experience in data management and his success with the open-source project DVC (Data Version Control). The company's mission is to provide AI developers with the tools needed to build and maintain high-quality datasets for training models and developing LLM applications, addressing the complex challenges of handling vast amounts of unstructured data.
Today, DataChain is used by teams working on AI models and LLM applications, particularly those dealing with computer vision, LLM and other data-intensive tasks. The platform supports users in organizing, querying, and updating massive datasets efficiently, making it a valuable tool for companies looking to scale their AI operations. Since its launch, DataChain has attracted a growing community of users and contributors, and its open-source data frame library has become a critical component of its success.
In this conversation, Dmitry takes us through the founding story of DataChain, the unique challenges of building a data infrastructure platform for AI, and the companyâs roadmap for the next 12 months.
Letâs dive in âĄď¸
Read time: 8 mins
Our Chat with Dmitry đŹ
Dmitry - welcome to Cerebral Valley! First off, give us a bit about your background and what led you to found Datachain?
Hey there! My name is Dmitry, and Iâve been working with data for quite a while. My PhD focused on data storage and management, and I was a data scientist at Microsoft Bing for several years. When youâre in that role, most of your time is spent on feature engineering for modeling or analytical tasks, and this led to the creation of DVC (Data Version Control). I started the open-source project DVC to track models back to the data, especially raw data in storage, where you donât have databases behind it. The project became popular, so I founded a company around it. Now, as CEO, Iâm working with customers on the roadmap, the product itself, and keeping a big focus on data.
Our goal is to figure out how to work with data on your S3 drive or other unstructured storage, how to connect data with models, and how to ensure reproducability. Thatâs how we arrived at DataChain. We realized DVC does a good job tracking data and connecting data to models and data sources in storage, but a bigger problem emerged. Using old terminology, feature engineering became a bigger challenge. The question now is, âI have millions of files, documents, images, or videosâwhich ones do I need for my model? Which do I need for training?â
DVC couldnât answer that question because it requires annotations, running inference, and making decisions based on that inference, like including or excluding data. Now, with LLMs, you can take a file, throw it to something like ChatGPT, and ask, âHow many people are in this image?â If there are more than 20 people or no people at all, maybe you donât need that image for your human face detection algorithms. People are using a lot of different tools these days for what we used to call feature engineering, but now itâs more about data curation. Thatâs how DataChain got started.
Give us a top level overview of DataChain - how would you describe the startup to those who are maybe less familiar with you?
Think of it like a data frame we've designed specifically for working with files and AI data structures. You start by getting files and building a kind of virtual table of these files. For example, you might have millions of files stored in S3, and in your table, you keep records of these files with details like checksums and other metadata.
Then, you can take those files and run them through a model like Mistral. You get responses and store those responses right next to the corresponding file in your table. So for each file, you know what the response was, how many tokens were used, and so on.
From there, you can keep going deeper, adding more layers of analysis. With each layer, you introduce more structure, making your data frame more valuable. You start with just a reference to the file, then add the Mistral response, then maybe a number indicating how many people are in the image, a boolean value for whether it's night or day, a vector for similarity search, and so on. You keep adding more data and structure, stage by stage, until you have a richly detailed data frame.
Who are your users today? Whoâs finding the most value in what youâre building with Datachain?
Weâre mainly targeting teams that are either building AI models and need high-quality data for training or those developing LLM applications. However, not every team will benefit from our tool right away. Teams need to reach a certain level of maturity firstâwhere theyâre focused on cleaning and curating their data, as well as evaluating the results of their training or applications. Thatâs when we can provide the most value by helping you prepare high-quality, curated data.
How do you measure the impact that Datachain is having on your key customers? Any customer success stories that youâd like to share?
We started with computer vision a bit before the LLM boom, and even today, a significant portion of our clients are still focused on images and videos. It begins with a simple equation: how do you keep track of 300 million annotated files? Thatâs the first problem. You canât just fit that into a regular data frameâit wonât fit, and your machine doesnât have enough memory. If you put it in a data warehouse, youâd have to deal with SQL and other complexities, which isnât easy for AI engineers. Thatâs where DataChain can help. It can build you a table or data frame thatâs out of memory, handling something like 300 million files.
You can query based on certain attributes, like needing files only from a specific directory or with a particular file patternâsay, by the date of creation. For example, you might need only files from the last month. Once you have this high-scale dataset, the next question is how to update it. Imagine you have a directory with filesâwhether it's thousands or just 25 filesâthe question becomes, "Whatâs changed since we last built and shipped our application?" Maybe something changed in the bucketâdid we get any new files? Were any files updated? You want to know exactly what happened.
So, you need a way to differentiate between basic file storage operations, especially in S3, to see that maybe only 20 files were added, or perhaps 30,000 images were created. Then, you can apply what we call data updates or data transformations. For instance, one of our customers spent over a week of compute time on a powerful GPU machine just to score their data and compute embeddings for images. It was a massive resource investment. Then, they received an update with a few thousand new images. The question is, do you want to waste another week of GPU time rescoring the entire bucket? Probably not. You want to know what specifically changed in the bucket.
When you identify the couple thousand new or changed images, you can compute the vectors only for that new data and then create a dataset that combines the new and old files, along with their embeddings. Instead of wasting a week of GPU compute time, you might spend just half an hour on a delta update and then union everything together.
Open-source has been a huge part of your work with DataChain. Could you describe the critical nature of open-source to your work thus far?
DataChain actually started as a SaaS product on customer sites rather than as an open-source project. This is a big difference from DVC, which was initially created as open-source, and the SaaS version came about four years later. With DataChain, we did the opposite, and that approach has its own pros and cons.
We decided to open-source one crucial part of DataChain: the data frame library. This library handles how to define data frames, how to serialize them, how to work with storage, file references, and objects like LLM responses. The idea is that you can get an LLM response and serialize it without worrying about SQL or other complexities. It also includes logic for handling changes in a bucket and deciding what to do next with the data.
This open-source functionality works on your local machine with a local embedded database. The goal is to help people curate their data, understand whatâs inside it, and extract insights from their files. For us, as a company, open-sourcing this library is essential for getting feedback from the community. Since we open-sourced DataChain just two weeks ago, weâve already received two pull requests from external contributors and some ongoing discussions, which is really valuable information.
The community feedback helps drive the product in the right direction. We want to lead in the areas of data curation, data analytics, and evaluation, particularly in the context of files and unstructured data. When it comes to structured data, thereâs already a well-established field with a beautiful landscape of toolsâdata warehouses, ETL, BI, visualization tools, statistical techniques, and so on. Itâs a great world of data management and analysis. The challenge now is to transfer that knowledge and those skills to people working with LLMs and computer vision.
People are starting to realize the importance of this because LLM evaluation is becoming one of the toughest challenges. The first time you put an LLM application in front of customers, itâs exciting and cool, but soon you realize you need moreâwhether itâs more safety or accuracy. For example, you donât want your LLM to mistakenly sell a brand new car for $1,000, which is something that actually happened a few months ago. So, evaluation is key to delivering AI thatâs reliable and safe for everyone.
How do you plan on Datachain progressing over the next 6-12 months? Anything specific on your roadmap that new or existing customers should be excited for?
We just open-sourced a part of DataChain, and now we're really focused on user use casesâfiguring out what people are picking up the most and which use cases we should concentrate on, especially in LLM applications. One area thatâs evolving quickly is the evaluation stage of the stack. How are people conducting evaluations, especially at a larger scale? Performance is also a big issue, which is common in data infrastructure. Weâre providing the infrastructure, not the smart algorithms, but a way to orchestrate all the pieces together to cover more use cases.
Getting more feedback from users and addressing more use cases is a top priority. Weâre focusing on the flexibility of the tool because weâre hearing more and more from users about where they hit limitations with the current workflow. LLM evaluation is a bit uniqueâactually, itâs new, and thatâs the challenge. In this new workflow, you need to try different approaches and see what works. Thatâs why community input is so crucial. We need broad feedback from dozens or hundreds of users each month to understand which direction to prioritize.
Lastly, tell us a little bit about the team and culture at Datachain. How big is the company now, and what do you look for in prospective team members that are joining?
First of all, we hire for passion. People need to be genuinely excited about this area because what makes our work both challenging and fun is that we operate at the intersection of infrastructure and AI. On the infrastructure side, you might not always know exactly what's needed from an AI perspective, so you need strong feedback from AI folks to really understand how things work. Your engineering intuition might lead you in the wrong direction because AI/ML folks often have a different mindset. If youâve worked with both sides, you know this well.
At the same time, building this tool purely with AI folks isnât possible either because itâs deeply infrastructure-heavy, requiring a strong understanding of deep tech. But this mix of the two sides is what makes it really funâyouâre always learning something new, whether itâs about infrastructure use cases or AI problem sets. Thatâs a crucial part of our teamâs dynamic and what drives us.
Open source and being a distributed team are also big parts of our DNA. Our company has been distributed from day one. Itâs fun because you get to work with people from different cultural backgrounds and time zones. Sure, time zones can be a challenge, but itâs something we deal with. Itâs exciting to be at the crossroads of two different worlds and to experience the best of both sides.
Conclusion
To stay up to date on the latest with DataChain, learn more about them here.
Read our past few Deep Dives below:
If you would like us to âDeep Diveâ a founder, team or product launch, please reply to this email ([email protected]) or DM us on Twitter or LinkedIn.