Click to learn more about author Zivan Ori.
This is Part Two of a three-part series: See Part One here.
As the need for increased data storage rises, NVMe solutions must adapt to support the needs growing neural networks when GPU nodes fail to accommodate SSDs. This is a common problem within AI/ML that stems from attempts to fit large datasets into a single GPU.
While solutions like creating GPU clusters and using local SSDs inside of the GPU to reduce latency may seem like the answer, obstacles with storage, data access and data protection will create problems with the performance.
While some GPU-based servers are capable of performing entire processing operations on their own, certain tasks require the use of multiple GPUs. Issues arise from these clustered GPU nodes when data is stored locally, preventing the nodes from sharing their dataset. Similarly, access to the same datasets take up valuable storage space by copying the same data to each individual CPU node. This is necessary for clustered GPU nodes in need of access to the same dataset for their machine learning training.
Performance bottlenecks happen when some installations use local SSDs as cache in an effort to bypass slow storage and expedite access to the working dataset. As a result, the amount of data movement leads to delays in cached data being available on the SSDs. As datasets grow, local SSD caching becomes ineffective for feeding the GPU training models at the required speeds.
Shared NVMe storage can solve the performance challenge for GPU clusters by giving shared read/write data access to all nodes in the cluster at the performance of local SSDs. The need to cache or replicate datasets to all nodes in the GPU cluster is eliminated, improving the overall storage efficiency of the cluster. With some solutions offering support for up to 1PB of RAID protected, shared NVMe data, the GPU cluster can tackle massive deep learning training for improved results. For clustered applications, this type of solution is ideal for global file systems such as IBM Spectrum Scale, Lustre, CEPH and others.
Use Case: Deep Learning Datasets
One vendor provides the hardware infrastructure that their customers use to test a variety of applications. With simple connectivity via Ethernet (or InfiniBand), shared NVMe storage provides more capacity for deep learning datasets, which would allow them to expand the use cases that it offers to its customers.
Moving to Shared NVMe-oF Storage
Having discussed the performance of NVMe inside of GPU nodes, let’s explore the performance impacts of moving to shared NVMe-oF storage. For this discussion, we will use an example where performance testing would be focused on assessing single node performance of using shared NVMe storage relative to the local SSD inside of the GPU node.
Reasonable benchmark parameters and test objectives could be:
1. RDMA Performance: Test whether RDMA-based (remote direct memory access) connectivity at the core of the storage architecture could enable low-latency and high data throughput.
2. Network Performance: How would large quantities of data affect the network, and whether the network became a bottleneck during data transfers.
3. CPU Consumption: How much CPU power is used during large data transfers over the RDMA enabled NICs.
4. In general, whether RDMA technology could be a key component of an AI/ML computing cluster.
I have in fact been privy to similar benchmarks. For side-by-side testing, a TensorFlow benchmark with two different data models was utilized: ResNet-50, a 50-layer residual neural network, as well VGG-19, a 19-layer convolutional neural network that was trained on more than a million images from the ImageNet database. Both models were read-intensive as the neural network ingests massive amounts of data during both the training and processing phases of the benchmark. A single GPU node was used for all testing to maintain a common compute platform for all of the tests runs. The storage appliance was connected to the node via the NVMe-oF protocol over 50GbE/100GbE ports for the shared NVMe storage testing. For the final results, all of the tests used a common configuration of training batch size and quantity. During initial testing, different batch sizes were tested (32, 64, 128), but ultimately the testing was performed using the recommended settings.
A single GPU node was used for all testing to maintain a common compute platform for all of the tests runs. The NVMe appliance was connected to the node via the NVMe-oF protocol over 50GbE/100GbE ports for the shared NVMe storage testing. For the final results, all of the test runs used a common configuration of training batch size and quantity. During initial testing, different batch sizes were tested (32, 64, 128), but ultimately the testing was performed using the recommended settings.
Looking at the Results
The appliance exceeded the performance of the local NVMe SSD inside the GPU node by a couple of percentage points in both image throughput and overall training time. This emphasizes one of the performance advantages of shared NVMe storage: the ability to spread volumes across all drives in the array gains the throughput advantages of multiple SSDs, which compensates for the any latency impacts of moving to external storage. In other words, the improved image throughput performance means that more images can be processed in any amount of time when using shared NVMe storage than when compared with local SSDs. While the difference is just a few percentage points, the advantage will stand out as more GPU nodes are added to the compute cluster. The training time with NVMe storage was much faster than with local SSDs, allowing customers to leverage 100TB or more datasets to increase speed and permit deeper learning for improved results.