Today, we announced support for 300-node clusters as a measure of our ability to scale CockroachDB. This is a number that gets us a bit ahead of our largest customer deployments, and is the first step in our three-year scale strategy that this coming year (2026) will take us to 500-nodes.

AI is, unquestionably, driving database scale and we're doubling-down on testing the biggest clusters we can in order to give our customers the confidence they need to deploy AI at scale.

Other vendors talk about "unlimited scale". With an architecture that's horizontally scalable with all nodes able to both read and write, that position is an easy cop-out – need more capacity, just add more nodes. Yet, while none of us really reveal the size of our largest customers though, there are hints. For example, Shopify talks about their 160 node cluster as being "if not the largest, then one of the largest Yugabyte deployments" out there.

History Lesson

To talk about these scale tests, I first want to take you back about two years. A little before my time at Cockroach Labs.

Step 1: Product Quality

We're a data driven company, and in around 2023 we noticed that engineering had a lot of disruptive escalations. Engineering led escalations are important and useful, but not ideal for keeping engineers in the flow of their own work. It impacted their productivity (and probably their morale).

We took a look at what was happening and began a product quality initiative. We were going to improve the quality of CockroachDB as measured by the number of escalations that reach engineering.

We tracked for about a year, and had great results. Escalations are few, engineering teams are humming along. And, of course, customers are the beneficiaries.

Quality really is at the foundation of what every company should deliver but also something that is never far from being top-of-mind. The last thing you want to do is take a customer down for 10-hours without explanation. That's a reputation killer that no vendor can afford.

Step 2: Performance

Following our quality push, we decided to focus on performance. After benchmarking against the usual subjects, we realized we had some work to do on performance. Again, a year long push with improvements noticeable from the get-go, and in 2025 we delivered over 100% performance improvement. This is critical, because not only does it mean you can get more performance out of any given configuration... it, in fact, allows customers to have a better total cost of ownership profile.

For those customers unfamiliar with this work, 25.2 had about a 50% performance improvement over the prior release and 25.4 brought the total improvement up to over 100%. That should give you a sense of which versions you should try to upgrade to if you're looking to capture the benefits of some of this work. It's also why it's frustrating to see Yugabyte do performance comparisons against a four-year old version of CockroachDB and call themselves the performance leader. For the record, we out-performed them even prior to 25.2, the version where we gained a 50% performance improvement.

Step 3: Core Product Promise – Resilience

Once the performance work was well under way, and we had 'good bones' of quality and performance on which to build, we wanted a way to measure our core product promise – that of resilience.

We launched an initiative to do just that, measure Performance under Adversity. The idea is simple, and explained in that linked blog post so I won't repeat much here, but in short – we created a test environment and ran tests while ALSO running six types of adverse events in addition to normal operations. The six are:

  1. Operational distress – online backups, online schema changes, online software updates, change feeds, etc. By the way, the ability to perform these operations while not impacting foreground traffic is due to a capability we've delivered call Admission Control. Not all vendors have invested in Admission Control, and the results show.
  2. Disk stalls – Common in cloud implementations but not easily dealt with. These prove to be a source of dramatic latency variance in other offerings.
  3. Network failures – both asymmetric and symmetric.
  4. Node failures.
  5. Zone failures.
  6. Regional failures.

And, again, it wasn't just about measuring but also about learning. Once we could measure resilience, we had a reliable way to improve it. And, if you go to the PuA dashboard you'll see the dramatic improvements from 25.1 to 25.4, especially around latency through each of the failure scenarios.

Delivering Scalability

Step 4: Core Product Promise – Scale

Through 2025 we institutionalized the process of delivering PuA metrics as part of the release cycle. We do this testing with each release and share the results. We test competitors using the same tests to make sure that we're world-class, but to date haven't shared those results publicly (except for Oracle).

‼️
You can't deliver scale without first delivering on the core promise of resilience. Scale that's not survivable isn't really the goal.

AI is, unquestionably, driving database scale. The thing is, and this is really important, you can't deliver scale without first delivering on the core promise of resilience.

Each of these things we've discussed – quality, performance, resilience... they're basic building blocks. They must precede any discussion of scale, because without resilience (or quality or performance) who cares how well a solution scales?

And, that's why announcing support for 300-node clusters is interesting. Because the scale testing we're doing incorporates PuA stressors. In fact, we delivered a on-line schema change of 120B rows while running the 300-node cluster at full capacity. We delivered full and incremental backups. We collected log information for troubleshooting.

What we've done is deep scale testing - not just the product itself (how much does the database scale) but of the stuff around it that often doesn't scale with the product like support and maintenance operations.

And, we're just getting started. We've publicly stated that our goal for this year is 500-nodes. And, we have other metrics we're exploring like queries-per-second and more.

These scale tests are expensive to run, require smart people who could in all honesty be doing other things, and take commitment up-and-down the management chain to execute with the same quality we put into everything else we do here at Cockroach Labs.

[editor's note: I use em-dashes but promise this was not written by AI.]