By Ben Lorica.
As much as I like talking and writing about machine learning and AI, I am equally keen to point out there are also many impressive1 startups in the data engineering and data infrastructure (DE) category. DE companies address fundamentals that need to be in place before companies can rely on reports and metrics. Any organization wishing to scale their use of AI and machine learning also needs DE tools. In fact almost all the tools in the buzzy category of MLOps assume that users already have their DE act together.
We’ll start seeing more mainstream appreciation for DE now that machine learning researchers are rallying around “data-centric AI” —a collection of tools and techniques for cleaning, augmenting, and enhancing datasets to improve the accuracy of ML models. Set aside this renewed interest in data among ML researchers, it has long been known to practitioners (ML engineers and other data professionals) that focusing on data is more important than focusing on modeling. Numerous surveys through the years have shown that data teams spend the majority of their time gathering, cleaning, and enhancing data sets.
This post is the first of what I hope will be a series focusing on segments of the DE ecosystem. I chose to kick things off with tools for data integration (the process of combining data from multiple sources to provide a unified view) and tools for building and managing data pipelines. Data integration makes data more readily available, easier to consume, and easier to process by other systems. Data pipelines involve moving data from a source or multiple sources into a data management system.
I’ll draw on the following sources: job postings, Linkedin profiles, and startup databases. As a result, we gain a deeper understanding of the demand and supply sides of the data integration labor market, as well as the startups providing the next-generation solutions.
Hiring managers are focused on recruiting skilled data engineers and analysts.
Demand (Job Postings)
Who is looking to hire data integration talent? The chart below looks at recent postings in nine U.S. technology hubs. I identified close to 6,000 job postings that mentioned at least one of the following terms: “data integration” ; “data transformation” ; “ETL”. Top consulting firms dominated the list of leading job posters. I’ve spoken to many people who work at these consulting firms, data integration/unification is essential for many of the projects they do. A few technology companies (Amazon, Apple, IBM, and SalesForce) also made the list2:
The lack of standard job titles make online job postings messy and somewhat tedious to work with. After cleaning up job titles from these postings, I was able to group them by specific roles. Hiring managers are focused on recruiting skilled data engineers and analysts. One in six (~16%) of all such postings sought data engineers; one in seven (~15%) were looking for analysts. Data integration projects are more likely to be handled by analysts compared to data scientists or machine learning professionals.
Sought after skills include areas that pertain to data quality, analytics, data warehousing, and data pipelines. Looking at job descriptions, some phrases jump out: interest in [“data quality”, “data analysis”, “data analytics”] is consistent with results from our 2022 Data Engineering Survey where data quality was listed as a key challenge. Another group of phrases [“data warehouse”, “data modeling”, “data architecture’] hint that many data integration projects pertain to data needed for reporting and analytics (these includes data that flows into modern data platforms – cloud warehouses and lakehouses).
Supply (Existing Talent Pool)
Using Diffbot’s knowledge graph, I was able to identify 300,000+ individuals3 who listed one of these key skills – “data integration” ; “data transformation” ; “ETL” – on their Linkedin profile. Recall from Figure 2 that top consulting firms dominated the list of employers looking to hire data integration talent. The chart below lists Software Consulting (19K+) and Management Consulting (13K+) firms among the top sectors where data integration talent are currently employed:
I took the global talent pool for data integration (300,000+ people) and looked at their current job roles. The lack of normalized job titles make individual profiles even more4 tedious to work with compared to job postings. In comparison to Figure 2 (roles used in job postings), “data engineer” is lower down this list.
Looking at the skills listed by data integration talent, reporting and analytics jump out (“data warehouse”, “SQL”, “business intelligence”, “data modeling”). As far as specific tools, it’s not surprising that one finds older systems (Oracle, SQL Server, PL/SQL, Informatica) cited at a higher frequency. As noted in figure 4, the list of people cuts across geographic regions, industries, and company size. Many of these older solutions simply have more users than recent systems and open source projects associated with modern data platforms.
Solution Providers (startup ecosystem)
With the current variety of options available, it is an excellent time to search for tools and solutions that simplify and modernize data integration. Several of these solutions address requirements that users expect from their modern data tools, such as scale, speed, fault-tolerance, connectivity to/from many external systems and data types, as well as security, privacy, and governance.
There are many impressive startups focused on data engineering and data infrastructure.
There are data integration tools aimed at different personas including engineers and developers, analysts and data scientists, and domain experts who aren’t coders (“low-code/no-code” solutions). Many of the tools that target coders increasingly include features that inject software engineering rigor into how teams build, deploy, and manage data pipelines. Depending on the vendor, these may include tools for running (unit/integration/functional) tests, and tools that make it easier to schedule, deploy, monitor, and manage pipelines at scale.
Data integration remains a very active and exciting area. Some of the best solutions come from companies that offer broad platforms (e.g., Databricks5 has outstanding offerings in this area). The graphic below is limited to startups that concentrate on data integration, data pipelines, and related areas. With not much effort, I was able to identify thirty startups and organized them using a simple taxonomy:
- General Purpose: Tools that aim to help companies build and maintain pipelines that move data from/to a variety of sources and (data management) systems. Data transformation capabilities vary: ETL tools transform data on a separate system before loading; ELT tools transfer raw data, and data transformation is usually handled by the destination system (data warehouse/lakehouse).
- Focused on connecting to popular SaaS and other apps: As companies increasingly gravitate towards best-of-breed systems, their data unification challenges have gotten more complex. Startups in this category focus on simplifying data integration problems that involve many of the most popular software-as-a-service platforms. Customer Data Platforms are another common offering of startups in this category.
- Domain or industry specific: These include startups that focus on data integration solutions for marketing, finance, or healthcare.
- Orchestration: At a high-level, these are tools that enable developers to write, schedule, and monitor pipelines. There are several well-funded startups founded by creators of popular open source projects. The truth is, earlier orchestration tools were hard to use and weren’t reliable at scale. Thus, this category is currently a hotbed for innovation – new tools are on the way that will simplify and rethink how companies build data and machine learning pipelines. As an example, Prefect just announced a compelling version 2.0 engine last week.
Data unification challenges have gotten more complex as companies gravitate towards best-of-breed cloud services. Fortunately, companies no longer need to write custom pipelines to extract, load, and transform data from popular SaaS platforms. Data teams now have access to data integration solutions that are simpler, cheaper, and easier to use and manage. Data integration is not the only component being modernized: data engineering teams are now able to build modular data platforms composed of best-in-class tools (what many refer to as a “modern data stack”).
Machine learning and AI introduce new data challenges to platform teams accustomed to working mainly with structured data or text. Data platform teams will need to consider more types of data and sources as computer vision and speech applications become easier to deploy. Looking ahead, as multimodal machine learning and BI become more common, users will need data platforms that can handle different data types.
I plan to take a deeper look into some of the categories listed in Figure 6 above. I also plan to invite entrepreneurs and developers focused on data integration on the Data Exchange podcast (suggestions are welcome). Subscribe to our newsletter to get the latest updates. Download this FREE report to learn more about the results of our recent Data Engineering Survey:
- Free Report: 2022 Data Engineering Survey Results
- Free Report: 2022 Trends in Data, Machine Learning, and AI
- Data Quality Unpacked
- Data Management Trends You Need to Know
- Taking Low-Code and No-Code Development to the Next Level
- Most State-Of-The-Art AI Systems Are Trained With Extra Data
 “Impressive”, based on revenue.
 I omitted a few IT staffing firms from this top 13 list.
 This is based on Linked profiles so China is likely to be underrepresented.
 Job titles used in job posts are likely to undergo some vetting from HR teams. Job titles used to create this chart are from Linkedin profiles, and are much less standardized and much harder to work with.
 I am an advisor to Databricks and other interesting startups.