The Data Pegacorns

Introducing the Data $100M Revenue Club

By Kenn So and Ben Lorica.

The global economy has deteriorated since our last post on AI pegacorns (startups that have at least $100M in annual revenue). Rather than focusing solely on valuation, there has been a rapid revalidation of the importance of revenue scale. As a continuation of our pegacorn1 series, we’re launching the Data Pegacorn list: data engineering companies founded on or after 2006 and who have reached or exceeded $100M in annualrevenue. We chose 2006 as the cutoff year as it coincides with the initial release of Hadoop, the open-source technology platform that started the big data era.

Figure 1: Membership criteria.


Figure 2: The Data Pegacorn Club. (List of companies is available below.)


How to contribute

We’d love for you to add any companies we missed! You can either submit using the form below, or submit a pull request to this public GitHub repo where we maintain the list of data pegacorn companies.

Pegacorn Candidates


  • Open-source companies are the minority overall but constitute the majority of data management vendors, particularly those that focus on processing and storage at scale (e.g. Databricks, Redis) vs those that focus on other features (e.g. Airtable for usability and Scale for labeling). At the data storage and access layer, open-source is prevalent because it is the lifeblood of a company. Open-source reduces lock-in and many of today’s most exciting data companies are open-source projects incubated within web-scale companies.
  • Data governance is the second largest category after data management. After storing data, companies want to make sure that they get value out of it as well as stay compliant with regulations. Data catalogs like Alation grew out of the need to make sense of all the data stored. OneTrust grew out of the need to comply with privacy rules and consumer preferences. 
  • Data is such a dynamic space that every few years “next-gen” solutions come along. From scalable computing with Hadoop, to in-memory databases like Redis, to cloud warehouses like Snowflake (which isn’t included in the list because it is a public company), to lakehouses like Databricks. This is different from the application SaaS space where market leaders like Microsoft, Salesforce, and Adobe can retain their positions for decades. But there’s always something new in data, either driven by the need for scale, changing user needs, new applications or regulations.
    1. While Airtable (2012) was initially built for end users, its widespread use has made it attractive to application developers and data engineers. 
    2. Privacy regulations led to OneTrust (founded 2016) growing to $420 million in annual revenue 
    3. The need for quality data to train machine learning models accelerated Scale’s  (founded 2017) growith to >$100 million 
  • We looked at the founders’ LinkedIn profiles to see what skills they list and no surprise that cloud computing, distributed systems, and data management are among the top. What’s notable is the mention of Hadoop, which is what kickstarted the big data wave.
Figure 3: Top skills listed on profiles of founders of Data pegacorn companies. Data via Diffbot.

Kenn So is an investor at Shasta Ventures, an early-stage VC, and was previously a data scientist. Opinions expressed here are solely his own.

Ben Lorica helps organize the Data+AI Summit and the Ray Summit, is co-chair of the NLP Summit, and principal at Gradient Flow. He is an advisor to Databricks and other startups.

Appendix: List of Companies

[Note: This is a snapshot as of 2022-06-06. For the most up-to-date list of AI pegacorn companies, see our GitHub page.]


  • 🇺🇸 Cohesity (“A modern approach to cyber resilience for hybrid and multicloud environments.”)
  • 🇺🇸 Druva (“Your Data. Always Safe. Always Ready.”)
  • 🇺🇸 OwnBackup (“#1 SaaS Data Protection Platform”)

Data Management

  • 🇺🇸 Airtable (“Connect everything. Achieve anything.”)
  • 🇺🇸 Databricks (“All your data, analytics and AI on one platform”)
  • 🇺🇸 DataStax (“The Open Stack for Modern Data Apps”)
  • 🇺🇸🇸🇪 neo4j (“Blazing-Fast Graph. Petabyte Scale.”)
  • 🇺🇸 Qumolo (“One File Data Platform For All Your Unstructured Data”)
  • 🇺🇸🇮🇹 Redis (“The world’s most loved real-time data platform”)
  • 🇺🇸 Scale (“Better Data. Better AI.”)

Data Integration

  • 🇺🇸 Databricks (“All your data, analytics and AI on one platform”)
  • 🇺🇸 Fivetran (“Centralize your data in minutes not months”)
  • 🇺🇸 Tealium (“The Most Trusted Customer Data Platform”)


  • 🇺🇸 Alation (“Data Intelligence + Human Brilliance”)
  • 🇧🇪 Collibra (“Process should be intuitive, not infuriating”)
  • 🇺🇸 OneTrust (“Trust Is More Than Just a Compliance Box to Check”)

Hardware storage

  • 🇮🇱 Infinidat (“… devoted to helping our clients compete more effectively in the petabyte era”)
  • 🇺🇸 VAST Data (“Simplicity at scale”)
Figure 4: From unicorns to pegacorns.

Related content:

Thanks for reading. To stay up to date, subscribe to the Gradient Flow newsletter.


[1] A Pegacorn is a flying (winged) unicorn.

%d bloggers like this: