Specialized Hardware for AI: Rethinking Assumptions and Implications for the Future

Exploring the Evolving Landscape of Hardware for Artificial Intelligence.

By Assaf Araki and Ben Lorica.

Specialized hardware is enabling artificial intelligence models to operate at faster speeds and handle larger, more complex applications. As a result, several firms, including Habana, Graphcore, Cerebras, and SambaNova, have emerged since 2015 to capitalize on this trend. In May 2016, Google introduced the inaugural version of its Tensor Processing Unit (TPU) for inference, and Amazon Web Services (AWS) introduced Inferentia at the end of 2019 and Trainium at the end of 2020.

The market for AI semiconductors is expected to reach $76.7 billion by 2025, growing at a compound annual growth rate of 28.2%. The driving force behind this growth is data centers, which have been fostering innovation in AI hardware start-ups and hyperscalers. In fact, start-ups specializing in specialized AI hardware for data centers have raised more than $2.5bn since 2015:

Figure 1: A representative sample of specialized hardware for AI.

In a much cited 2018 report, Open AI documented the exponential increase in the amount of compute utilized in the largest artificial intelligence training runs since 2012. During the same time frame, a doubling period of 3.4 months was observed, a significantly shorter time period than the 2-year doubling period seen with Moore’s Law. The authors speculated that various hardware startups, focused on developing AI-specific chips, would drive this trend forward in the near future, and that new hardware would lead to significant increases in FLOPS/Watt and FLOPS/$ over the subsequent years.

Specialized Hardware for AI: A Quick Recap

We begin by re-examining some assumptions around hardware for AI. Several years ago, the industry developed expectations around AI hardware accelerators, and we revisit those assumptions and determine which ones still apply today.

Figure 2: AI Hardware – assumptions vs. reality.

The only valid assumption¹ today is that training serves as a control point – hardware used for training also garners most inference workloads. As a result, training continues to receive considerable research attention, investments, and focus. Teams continue to develop new and larger models, while simultaneously investigating tools and techniques that are more efficient.

Industry players are conducting the majority of research and development for inference workloads. A KDnuggets survey revealed that a relatively low percentage of models are actually deployed to production, with less than 40% being the average for survey respondents and a majority reporting a deployment rate of less than 20%. Optimizing and streamlining inference requires engineering and data science work in order to improve software performance without sacrificing accuracy. This process becomes even more complex when attempting to use hardware different from that used in the training phase. While tools are available to facilitate this process, it requires a significant level of focus and effort. Instead of using different hardware for inference, most teams would rather use the same hardware for training and inference, and focus on developing better models.

Hardware implications of recent developments in AI

Models continue to get larger

Since OpenAI’s 2018 report, we’ve seen increasingly large models, particularly for NLP and natural language applications. Specialized accelerators are important for fostering innovation in AI, but they aren’t sufficient. Model sizes and FLOPS have grown faster than memory and bandwidth. The rise of large models reinforces the need for simple and flexible, distributed computing frameworks. As Ion Stoica of UC Berkeley noted a few years ago:

“… to realize this promise we need to overcome the huge challenges posed by the rapidly growing gap between the demands of these applications and our hardware capabilities. To bridge this gap, we see no alternative but to distribute these applications.”

There is a lot of excitement surrounding Generative AI and Foundation Models

Generative AI has the ability to create entirely new content, rather than simply analyzing or responding to pre-existing data. These models are able to generate text, images, blog posts, program code, poetry, artwork, and more. Generative AI opens the door to many new businesses and applications that require content. Jasper, for instance, raised a large round of funding at the end of 2022 after remarkable customer growth.

Generative AI companies rely on foundation models – models which are trained on broad data and then adapted for downstream tasks and applications. For instance, those focused on text applications may use an external model or construct their own language model (large models that take in text and output text).

Some companies may decide to build upon existing foundation models and tailor them to their specific needs and use cases, while others may opt to create their own due to a desire for greater control, data privacy and compliance reasons, and the ability to quickly implement required features. This is in contrast to using external foundation models, which often only offer API access and may pose issues with data transmission/sharing and the speed of adding necessary features.

Generative AI firms that decide to construct their own foundation models will be motivated by economic efficiency to turn to AI hardware accelerator firms that offer AI rack-scale systems that combine hardware and software to lower total cost of ownership.

Renewed focus on data tasks

It is widely recognized among machine learning engineers and other data professionals that prioritizing data is more effective than solely focusing on modeling. This is supported by years of surveys indicating that data teams frequently spend the majority of their time on data acquisition, cleaning, and augmentation. In 2021, Andrew Ng coined the term data-centric AI – techniques and tools for improving the accuracy of ML models through the cleaning, augmentation, and enhancement of datasets – and it became a rallying call for practitioners and researchers.

The only valid assumption today is that hardware used for training also garners most inference workloads.

In order to implement a data-centric approach, teams must thoroughly prepare their data prior to training. This process, referred to as preprocessing, involves various tasks such as cleaning, deduplication, denoising and anomaly detection, and visualization to optimize the data for training. While preprocessing is a vital part of the overall training process, it can differ in computational needs compared to training itself. To optimize end-to-end ML pipelines, it is necessary to consider both data preprocessing and training speed in order to maximize resource efficiency. As trained models become more prevalent and are fine-tuned by various individuals, the significance of the preprocessing stage in the pipeline only grows.

The emergence of decentralized custom models

As stated earlier, some teams choose to develop their own custom foundation models for reasons such as increased control, data privacy and compliance, and the ability to quickly add necessary features.

–

Stanford’s Percy Liang on the likely rise of decentralized custom models

Another class of applications also require custom models. While the trend towards large model training continues to dominate the conversation surrounding artificial intelligence, an increasing number of businesses are discovering the need to train and deploy a greater volume of (smaller) machine learning models – sometimes numbering in the hundreds or thousands. Recent Anyscale blog posts (here and here) listed common scenarios including the need for custom models per geographical zone, or per sensor/device, or per customer/product. Unlike a centralized model that serves up personalized recommendations or outputs, these models are distinct and trained separately (preferably in parallel).

Compared to centralized models of significant size, these decentralized custom models require a different set of hardware specifications. In the context of many small models, the training of each model requires fewer resources (due to the parallelization of training) which do not communicate with one another. While smaller models may require more computing resources because of their quantity, they require fewer scale-out capabilities.

Closing Thoughts

The recent media attention on the semiconductor shortages and geopolitical competition for AI leadership in 2022 has caused hardware to become a major concern for numerous data and AI teams. While these issues persist, AI researchers continue to invest resources in various areas, including multimodal models, graph neural networks, robotics, and more.

Figure 3: Key topics found in 2022 arXiv.org papers in ML and AI; graphic is a variation of **LDAvis**, a visualization used to display topics.

A significant amount of attention has been paid recently to the emergence of advanced supercomputing technologies for AI. NVIDIA and Microsoft announced a partnership to develop Quantum-2, a cloud-based supercomputing platform for AI applications. Tesla also revealed plans for its Dojo supercomputer, which utilizes proprietary chips and aims to offer access to enterprise clients in the near future. Cerebras unveiled Andromeda, an AI supercomputer boasting 13.5 million cores and the ability to deliver over 1 exaflop of AI compute power. Jasper, which uses large language models and boasts a customer base of over 85,000, has announced a partnership with Cerebras in order to further enhance performance capabilities and optimize its next set of models.

The hope is that a new open-source software stack for the training of large models will mature.

AI is also increasingly being used to design hardware. As AI workloads continue to grow, chip designers are facing increased pressure to innovate and improve the performance of computing platforms. One way that the design process is benefiting from AI is through the use of AI-powered tools, which can optimize time-consuming tasks and allow designers to focus on more creative work. This not only increases efficiency, but also drives further innovation in the field.

The training of large models has long relied on Nvidia GPUs and Google TPUs. New hardware accelerators face a significant hurdle, as the companies behind them do not possess the resources to develop a software stack comparable to CUDA or XLA. An alternative solution is the development of an open-source software stack that extends to the accelerator instruction set. One notable advancement in this area is the emergence of a software stack centered around PyTorch 2.0 and Triton, which aims to make it easier and more accessible for non-Nvidia backends to be implemented. The hope is that this open-source software stack will mature and eventually provide a viable alternative to CUDA for the training of large models.

In closing, we examined the use of specialized hardware for servers. However, it is important to note that edge computing is gaining traction as a transformative trend in the field of computing. While traditional models often rely on centralized, high-capacity computing and storage resources, edge computing involves processing data on the periphery of a network rather than sending it to a central location. This approach can provide lower latency and increased security in numerous situations. We’ll address hardware trends in edge computing in a future post.

Related content: Other posts by Assaf Araki and Ben Lorica.

Assaf Araki is an investment director at Intel Capital. His contributions to this post are his personal opinion and do not represent the opinion of the Intel Corporation. Intel Capital is an investor in Anyscale & SambaNova and Intel acquired Habana. #IamIntel

Ben Lorica helps organize the Data+AI Summit and the Ray Summit, is co-chair of the NLP Summit, and principal at Gradient Flow. He is an advisor to Databricks, Anyscale, and other startups.

If you enjoyed this post please support our work by encouraging your friends and colleagues to subscribe to our newsletter:

[1] One notable exception is performance on large-scale training workloads: TPU v4 outperformed Nvidia’s submissions in certain categories.