• Cerebral Valley
  • Posts
  • Remyx is your ExperimentOps Platform for AI Development 🧪

Remyx is your ExperimentOps Platform for AI Development 🧪

Plus: Co-founders Salma Mayorquin and Terry Rodriguez on transforming AI research discovery, automating paper-to-PR workflows, and bringing scientific rigor to machine learning experimentation...

CV Deep Dive

Today, we’re talking with Salma Mayorquin and Terry Rodriguez, Co-Founders of Remyx.

Remyx offers the ExperimentOps platform as the intelligence layer for AI experimentation, to bring scientific rigor to AI development workflows. Founded by Salma and Terry as a team of two, it addresses the critical gap between offline AI experimentation and online production results. Built around their innovative GitRank feature, Remyx can automatically discover relevant research papers from arXiv, generate reproducible environments, and create testable pull requests that implement cutting-edge methods directly into codebases. The platform's goal is to enable AI teams to systematically discover, test, and deploy new techniques while closing the evaluation loop with controlled online experiments.

Today, Remyx has already seen adoption from early pilots across industries like cybersecurity and aerospace, with teams using it for everything from automated paper discovery to accelerated  AI development cycles. The system is designed to support the need for technical innovation across industry. Its integration with Statsig for A/B testing and focus on research-to-production pipelines make it a standout solution for AI teams looking to move beyond ad-hoc development practices. The platform offers a free developer version and is actively seeking enterprise partnerships for in-house deployments.

In this conversation, Salma and Terry share how Remyx evolved from a development workbench to an ExperimentOps platform, the breakthrough moments that led to GitRank, and their vision for transforming how AI teams discover and implement new research at scale.

Let’s dive in ⚡️

Read time: 8 mins

Our Chat with Salma and Terry 💬

Salma and Terry - welcome back to Cerebral Valley! For readers who might be new to Remyx or need a refresher, could you give us a quick overview of what you've been up to over the last six months?

The last time we talked about the challenges in the AI development process that developers are currently experiencing, particularly the experimentation process and evaluation of those experiments. When an AI developer goes about trying a new idea and following that idea through to making a new change into their current stack, there are lots of missing parts or tools that are not currently available to help an engineer systematically learn from those changes.

At the beginning of the year, we were starting to describe some of those challenges that we had experienced ourselves, as well as what we were observing in the field. The challenges we still face today as an industry in evaluating these systems and trying to understand exactly what changes lead to what kind of behavior changes, and how we can establish a way to understand how those changes are going to affect users.

Six months ago you described Remyx as a development workbench for AI engineers. Now you've coined the term "ExperimentOps". How did that come about and what does ExperimentOps mean in practice?

Over the last few months there's been a lot of advances in AI assisted coding and agent frameworks that has enabled us to go from fixed workflows like fine tuning and data curation, and move more into flexible code generation. We could start to describe systems that are making changes to your code and helping you to experiment with new ideas from the arXiv. As we were leaning into these capabilities, we realized that we were able to start working on parts of the problem that have been not tackled through traditional DevOps and MLOps workflows.

We realized that experimentation is about operationalizing your learning in the context of rapidly moving technologies. In DevOps, the atomic unit is the code repository and the output is production-ready software. In MLOps, the atomic unit is the model artifact and the output is production-ready ML pipelines. And in ExperimentOps, the atomic unit is the experiment itself, and the output is knowledge: validated insights that help you stay on top of what's happening in open science and quickly adapt new techniques into production.

AI can help teams close the loop, from sourcing ideas from the research literature, to implementing them flexibly, to running controlled experiments, to analyzing results and feeding that understanding back into what to try next. Adopting an experimentation framework will help scale your ability to learn what matters when multiple factors could drive improvements.

Remyx also introduced GitRank, which automatically generates testable PRs from research papers directly into code bases. Can you walk us through how this works and what inspired you to work on this feature?

GitRank began out of trying to solve our own problems with some of the open source projects that we were working on. We were frustrated with the experience of discovery. When you're following influencers on social media, it seems like a lot of the same people are talking about the same kind of hype new papers, but there's actually hundreds of archive papers being published every day. So many of them you won't discover - they're from labs you've never heard of, they have repos with no stars and nobody's talking about the work, but it's a goldmine of ideas.

The first aha moment was being able week over week to actually find specific papers that were directly referencing the models that we were creating in their own benchmarking. Finding those papers the day after publication, instead of weeks, months, or maybe never was really exciting for us. We pushed this idea to where you can go beyond recommended reading and beyond building reproducible environments to actually start implementing the core methods from these papers in your own code. GitRank combines the strengths of human expertise through open science with those of LLMs in processing data and pattern matching. With GitRank, we can take context about the code and applications you're working on, match those to papers that hit the arXiv every day, and implement them as draft PRs for your repos.

Let's talk about your research to production pipeline. How does that ecosystem work together and what does it mean for how AI teams discover and implement new techniques?

The problem that a lot of AI and ML engineers are facing is quite research adjacent. In the context of an arXiv paper, researchers can really only explore one factor at a time to make the analysis tractable. But in the real world you have all of these techniques to consider and you should expect nonlinear interactions with them. Being able to help people introduce these methods and combine them and explore the interaction effects to really get to what's working best for their application is super valuable.

We're thinking bigger about how this can be a value add for the entire research ecosystem. We've been talking with researchers, surveying people whose papers we've discovered and trying to understand what would be a value add for them to be able to quantify their impact beyond citation counts. Maybe start thinking about impact in terms of replication counts and understanding what industries their work is making impact in.

Do you have any success stories of how either GitRank or the research to production pipeline has helped a team discover or implement new methods?

We've had some early adopters and early pilots. We recently talked to a cybersecurity firm that is interested in using agents to be able to lock down data assets for their customers. They're actively trying to find new research that is coming out in the cyber security space with regards to AI adoption.So they're super excited to use the recommendation that we have in Remyx to find new daily papers.

A lot of feedback we've heard is that folks are spending a ton of time reproducing the environments that are necessary to just run the code that is referenced in research papers. That in itself is a huge time saver. Earlier in the year we were working on an open source project called VQAsynth, and through these paper recommendations and being able to automatically generate environments to test the code, we expedited  our ability to find the best ideas to improve our work. 

In our last conversation, evaluation was one of your biggest challenges and now you've integrated with Statsig for controlled online experiments. How has this been a game changer?

Integrating with well-adopted controlled experiment platforms is our most recent addition, and it's crucial for closing the full loop from idea to production. Many teams aren't yet using these evaluation methods, but trustworthy AI evaluation remains an industry challenge. A lot of the practices being promoted right now lack the sensitivity teams need to rely on them. They're prone to overfitting on proxy metrics through optimization loops that use LLM-as-a-judge. Controlled experiments are the gold standard for the sensitivity and alignment to business outcomes and user satisfaction that teams actually need. As AI becomes critical to more user-facing applications and business processes, we anticipate more teams will adopt these methods, and we want to make it easier to bridge the gap between early exploration and rigorous testing.

Out goal is to help connect the dots across the entire lifecycle. The offline metrics you gather cheaply during development are often weak predictors of how users will respond to your application. By integrating with controlled experiment platforms, we help you track hypotheses, early experiments, and explorations all the way through to online A/B tests. This means you can finally see how your offline work relates to real-world outcomes and carry that learning forward into your next iteration. This is what makes the AI development process operationalized: capturing the full arc from "what if we tried this paper's approach?" to "here's what actually moved the needle with users."

You and Terry were a team of two when we last spoke. How has the team grown, and what are your priorities until the end of this year?

We're a mighty team of two and over the past few months, we've expanded the surface area of the experimentation possibilities for developers with things like paper recommendations, GitRank, and supporting robust validation.

The future of how teams learn and adapt in these rapidly moving spaces will be shaped by the community that forms around these ideas. We're hosting Experiment 2025 on October 30th in San Francisco to bring together the community of researchers, engineers, and product builders who want to move faster from ideas to production. Come discuss what operationalizing learning really means, share what's working (and what's not).

We're also looking to onboard more early pilots. If you're a company trying to accelerate your development, discovery, and validation, let's talk!

You can find us at Remyx AI, where you can try a free developer version of the entire experimentation platform. We're also on X, LinkedIn, Substack, and actively share resources on Hugging Face, Github, and Docker. Whether you're a company that wants to pilot this in-house, a developer interested in the space, or a potential open source collaborator, we're actively looking to collaborate with those who believe experimentation is the next frontier of building delightful, production grade AI applications.

Conclusion

Stay up to date on the latest with Remyx, follow them here.

Read our past few Deep Dives below:

If you would like us to ‘Deep Dive’ a founder, team or product launch, please reply to this email ([email protected]) or DM us on Twitter or LinkedIn.

CVInstagramXAll Events