SMI's Enterprise PCIe Gen5 SSD Development Platform from Core to Edge for AI Era
Substantial amounts of data for artificial intelligence (AI) workloads are being collected from various sources at the edge, including Internet of Things (IoT) devices, consumer smartphones, and autonomous vehicles. This data is relayed to data centers, which must continuously advance their processing, memory, and storage capacities to meet these increasing demands efficiently.
Hyperscale data centers and high-performance computing must store large volumes of data efficiently and access it quickly for AI data pipelines. Key factors for AI data storage include performance, power consumption, and total cost of ownership (TCO), not just capacity.
Hard drives can store large volumes of data collected from the edge for AI training, inference, and validation while it is at rest, but SSDs offer benefits that support the complete pipeline, including storing data and delivering it quickly and efficiently to numerous GPUs and other processors and accelerators. Among them are security features that protect the integrity of data as well as capabilities that optimize data placement and performance shaping that not only accelerate data transformation but also improve energy efficiency.
Where is AI Data, How Is It Being Used and Trends Affecting Storage
The AI data pipelines relies heavily on substantial quantities of data, which are processed in data centers for ingestion, preparation, training, and inference. This data can be in various forms, such as text and video, resulting in extensive data sets that must be efficiently stored to facilitate manipulation and transformation.
Machine learning algorithms look for inter-dependencies and patterns with data sets and apply those learnings to any new data that is ingested. More data and higher quality data is how these algorithms are refined and improved over time.
This data isn't uniform, however, which means it must be prepared for training after it is ingested. The preparation and subsequent training and inference means data isn't sitting idle. It is being read, moved around and transformed.
AI Data Pipeline and Its Demand on Storage
The AI data pipeline puts new pressures on storage. It is not enough to be able to store copious amounts of data that can be accessed quickly - high capacity is not the only requirement. Depending on the stage of the AI data pipelines, storage must manage both high performance sequential and mixed workloads.
A typical AI data pipeline for a large language model (LLM) requires storage that maximizes data efficiency and minimizes total system power while reducing training completion times.
The first stage of the pipeline ingests data in high volumes and velocity, requiring high throughput sequential write with append operations. It retrieves raw data from various external sources, including IoT devices, web scraping, databases, and autonomous vehicles.
This unstructured data is stored in its raw form; preparation data is then modified, which involves many different, complicated workloads. These workloads influence how sequential read and write performance must be handled.
The data preparation stage of an AI data pipelines is read intensive, sequential, and lower latency with sequential writes. Data must be cleaned because the collection methods are unstructured - corrupt or duplicate data, or simple "dummy data" that is not helpful for machine learning purposes must be sifted out. Because unstructured data is not categorized, formatted, or stored in a structured manner necessary for proper processing, it must be preprocessed, which involves automating classification and storage for use before processing can occur.
The actual training state of an AI data pipelines involves churning of data through many different processes. Random read performance is critical - there is a high degree of read and write throughout the training stages which requires low latency and extremely high IOPs. The training process usually has checkpoints and restore points in the event something goes wrong.
Checkpointing is more than writing data; it is more nuanced and complex than the auto save feature in Microsoft Word document. And if there is a problem that requires the training process to go back to a checkpoint, the storage media must support burst sequential read to enable recovery - a system restore of a large amount of data.
Inference requires both sequential read and write at low latency. Reference models for online shopping use inference to display to a customer what customers like have looked at or bought, which involves a great deal of random read and write performance from the storage media.
At the end of the pipeline there is data archiving, which requires high-capacity storage and sequential writing, but that data may get read again and even re-ingested.
All these stages together constitute an AI workload, and it is not a one-off process - it is a never-ending cycle of data being collected and transformed. These looping functions mean the storage environment must be tailored at any given point to meet performance requirements while achieving better return on investment (ROI).
- Transforming AI Data Pipelines with Advanced SSD Technology (Part2)
- Transforming AI Data Pipelines with Advanced SSD Technology (Part3)