GPU and CPU Comparison
Nvidia CUDA Architecture
CUDA Unified Memory and Address Space (UVA)
GPU Memory Hierarchy
Streaming Multiprocessors (SM)
GPU Performance Considerations
GPU and CPU Monitoring Differences
Nvidia GPU Metrics
Nvidia GPU Profiling
CUDA Environment Setup
GPU Sample Program
GPU Benchmark
Tensor Cores - Nvidia New Turing Architecture (Volta/T4 GPUs)
General Purpose GPU Computing
General Purpose GPU computing paradigm was started when Nvidia introduced CUDA (Compute Unified Device Architecture) compliant GPUs, that enables them to be programmed via high level programing languages like C and C++. In GPU-accelerated computing, program utilizes GPUs in conjunction with CPUs to accelerate compute heavy tasks, common in the fields of: AI, Machine Learning, Deep learning, Analytics, and many other engineering applications. Although application code runs on a cpu, it can offload compute intensive portions of the program, called kernel, to GPU. Net result is an overall performance boost. No knowledge of graphics programming (OpenGL or DirectX) is required when coding in CUDA language. One may have to learn modestly extended version of C.Graphics APIs use high-level shading languages: DirectX, OpenGL and Cg. CUDA, on the other hand, can help writing general purpose extensions in C, that exploit massively parallel processing of GPU. This type of GPU programming is a clear separation from the early GPU programming model. CUDA offers flexible way of programming GPU that allows new algorithms to be developed and deployed quickly and efficiently
GPU with high memory bandwidth, thread parallelism (parallelism can help hide memory latencies), and abundance of fast register and L1 cache memory can able to improve matrix multiplication algorithm by storing larger datasets closer to the processing engine and that makes GPU ideal platform for Machine and Deep Learning. If a problem can be solved as a data-parallel computations, then each data elements can run in parallel. No sophisticated data flow and large caches are required, as found in CPU, considering the same function can be applied to each data element in parallel and the access latency can be hidden due to same calculation across all elements with high arithmetic intensity.
Many applications that process large datasets can use data-parallel programming model to speed up computations. Parallel data processing maps data elements to parallel processing threads that offers higher arithmetic intensity, where arithmetic intensity is simply a ratio of arithmetic operations to memory operations. For example:
- In 3D rendering, large sets of pixels and vertices are mapped to parallel threads.
- Image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, vision and pattern recognition can map image blocks and pixels to parallel processing threads.
- Neural networks (DNN, CNN, RNN) are built using large numbers of identical interconnected neurons distributed across multi layer networks, they are highly parallel by nature. This sort of parallelism maps naturally to GPUs, which offers remarkable speedup over CPU-only training. Neural networks rely heavily on matrix math operations and require floating-point performance and memory bandwidth. GPUs have thousands of processing cores optimized for matrix math operations, providing tens to hundreds of TFLOPS of performance.
- Internal model training benchmarks showed training (FFD, LSTM, CNN) time can be reduced several folds with GPU when compared to CPU.
CPU and GPU Comparison
Nvidia CUDA Architecture
CUDA is a combination of hardware and software architecture that enables Nvidia GPUs to execute programs written in C, C++, Fortran, OpenCL, DirectCompute and other languages. Nvidia GPUs are built around CUDA architecture.
CUDA uses parallel programming model that breaks the compute intensive task into hundreds or even thousands of parallel task that can run concurrently across thousands of cores in GPU. Nvidia GPUs are optimized for throughput not latency. CUDA treats GPU as a coprocessor. CUDA program requires copying data from CPU memory to GPU memory and vice versa. CUDA program strives to minimize data movement between CPU and GPU and in turn increase GPU utilization. CUDA programs with the help of Nvidia drivers transfer data to/from GPU via DMA (Direct Memory Access) transfer. It is alway CPU that initiates data transfer in both direction. GPU writes results into DMA buffer (part of system RAM) for cpu to pick it up. GPU program performs following high level operations:
- CPU allocates memory on GPU (aka. device) by calling cudaMalloc()
- CPU copies input data from CPU memory to device memory by calling cudaMemcpy()
- CPU launches a function, called kernel, on device to process the input data
- CPU copies results back from device memory to CPU memory by calling cudaMemcpy()
CUDA Unified Memory and Address Space (UVA)
CPU and GPU memory are physically separated by PCI-Express bus. Data sharing between CPU and GPU require program to explicitly allocate and copy data between two physical memory locations. CUDA unified memory model allows sharing of CPU and GPU memory using a single pointer. Unified memory looks like CPU memory to code running on the CPU, and GPU memory to code running on the GPU. System transparently migrates data allocated in Unified Memory between host (CPU) and device (GPU) when accessed. To allocate space in unified memory pool, program calls cudaMallocManaged() routine.Unified Address Space (UVA) enables "Zero-Copy" memory, which is pinned (locked) host memory accessible by code running on device directly over PCI-Express bus, without requiring memcpy(). This feature allows an application to use single address space no matter where it is running: host or device. Memory allocated via CUDA API residing on host or device is mapped to the same virtual address space.Location of memory on host or device with UVA can be determined by calling cudaPointerGetAttributes() and passing a pointer to the address. Application may also query if UVA is used for a particular device by checking unifiedAddressing device property is set to 1. UVA allows cudaMemcpy() to be used without specifying where exactly input and output parameters resides.
NOTE: Unified Memory support eliminated "deep copy", an expensive operation. For example: to use the struct of data elements on the device, one has to copy all the data members in struct, in addition copy all the data that the struct points to, and then update all the pointers in that copy of the struct, resulting in complexity of passing data elements to a kernel function. Unified Memory simplifies it by eliminating this extra step as it operates on the same pointer as the host code.
Streaming Multiprocessors (SM)
A multi-processor (SMX) is designed to execute hundreds of threads concurrently. When a CUDA program on the host invokes a kernel grid, the blocks of the grid are enumerated and distributed to multi-processors (SMX) in GPU with available execution capacity. Multiple thread blocks can execute concurrently on one multiprocessor. As Thread Blocks terminate, new blocks are launched on the vacated multiprocessors
GPU with more multiprocessors will automatically execute the program in less time than a GPU with fewer multiprocessors. |
WARPS
SMX creates, manages, schedules and executes threads in group of 32 parallel threads called warps. When a multiprocessor is given one or more Thread Blocks to execute, it partitions them into warps. How warps are used by SMX are described below:
- Each SMX has a set of 32-bit registers that are partitioned among the warps, and a parallel data cache or shared memory is partitioned among the Thread Blocks.
- Individual threads composing a warp start together at the same program address, Each thread in a warp has its own instruction address counter and register state and are therefore free to branch and execute independently.
- A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the thread converge back to the same execution path. Branch divergence occurs only within a warp. Different warps execute independently regardless of whether they are executing common or disjoint code paths.
- Threads of a warp that are on the warp's current execution path are called the active threads, whereas thread not on the current path are inactive(disabled). Threads can be inactive because:
- They have exited earlier than other threads of their warp.
- They are on a different branch path than the branch path currently executed by the warp.
- Last threads of a block whose number of threads is not a multiple of the warp size
- If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized write that occur to that location varies depending on the compute capability of the device and which thread performs the final write is undefined.
- The execution context (program counter, registers, etc.) for each warp processed by a SMX is maintained on-chip during the entire lifetime of the warp. Therefore switching from one execution context to another has no cost, and at every instruction issue time, a warp scheduler selects a warp that has threads ready to execute its next instruction and issues the instruction to those threads.
- Thread block is partitioned into warps. Each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0.
- Number of blocks and warps that can be processed together on the SMX for a given kernel depends on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on the SMX.
- Compute capability of GPU depends on number of warps per SMX, registers and shared memory available on the multiprocessor. It differs on GPU type. Kernel fail to launch if there are not enough registers or shared memory available per SMX to process at least one Thread Block.
GPU Memory Hierarchy
Thread can access data from various memory spaces on GPUGPU Performance Considerations
Consider following guidelines to improve GPU utilization and in turn reduce model training time.
- Make conscious design decision to reduce host and device memory transfers that can improve GPU utilization. Some Machine Learning algorithms or models can be executed completely on GPU and does not require CPU computation. For example: all stages of the tree construction (Decision Tree algorithm ) can be efficiently completed on GPU.
All stages of training running on GPU |
- Consider using Gradient Boosting for training your model to achieve higher accuracy for regression and classification tasks. Popular library for applying Gradient Boosting is XGBoost. XGBoost stands for Extreme Gradient Boosting algorithm, allows faster and scalable training in a multi-gpu setup. XGBoost library makes it possible to run all phases of training on GPU: gradient calculation, feature quantization, prediction, decision tree construction and evaluation. Some XGBoost library features:
- Training the model requires whole dataset to fit in the combined memory of all GPU devices. Data compression techniques used by XGBoost reduces the GPU memory usage and allows much bigger data sets to be trained in parallel.
- XGBoost uses symbol compression to store the quantized input matrix on the device. Matrix values are compressed down to log2(max_value) bits, where max_value is the maximum integer value of any quantized matrix element. Data is not modified once on the device and is read many times. The small number of bitwise operations computed on the GPU incur no visible performance penalty. This bit compression method typically reduces GPU memory consumption by 4x or more over the standard floating point representation, thus allows training on significantly larger datasets.
- XGBoost on gpu works on a quantile representation of the input feature space. Having features in quantized form reduces the tree construction problem to one gradient summation into histograms, speeding up execution time. The feature x is binned, so that in each bin there is roughly the same number of data-points. Reduce the computational complexity of finding the best split from O(nfeatures x ninstances) to O(nfeatures x nbins) with nbins << ninstances
- Internal benchmarks reported model training time reduction of up to 20x when using xgboost library
- Scaling computation from one GPU to multiple in a single machine can enable much faster model training. Horovod Library makes it fast and easy (minimum code changes) to do distributed deep learning in TensorFlow. New Nvidia GPUs supports NVLink for inter-gpu communication, that offers 10x times higher throughput than PCIe bus.
- Train model with mixed precision. Nvidia new Tensor cores (Volta/Turing GPU) offer hardware acceleration for mixed precision training. Lower precision than 32-bit floating point requires less memory and bandwidth. Math operations run faster in reduced precision. Up to 3% speed up is possible with mixed precision training with Volta/Turning architecture.
- Simple and small algorithms can be used as a building blocks for massively parallel algorithms. GPU primitives may be used to compose more complicated algorithms while retaining high performance, readability and reliability. Some examples of parallel primitives:
- Reduction Harris
- Parallel prefix sum (scan)
- Radix sort
- Segmented scan and reduce
- Interleaved sequences: multi-reduce
- Interleaved sequences: multi-scan
- GPUs are optimized for 32-bit floating point operations, but not for 64-bit double precision. Consider 32-bit parallel and sequential summation. 32-bit parallel summation shows dramatically superior numerical stability considering error of parallel summation grows proportionally to O(logn), as compared to O(n) for sequential summation
Nvidia GPU Metrics
Nvidia GPU Metrics
There are some basic metrics that can be fetched via Netflix Vector or Grafana, front ends to PCP open source package to find GPU utilization.
Nvidia GPU Profiling
One can profile functions (kernel) running on GPU and isolate the cause of GPU bottlenecks and low utilization via Nvidia tools: nvprof (cli), Visual Profiler (GUI) for annotating events, code ranges and application resources. Also Nvidia C API, called NVTX , can be integrated into the program to: capture, visualize (via Visual Profiler) and trace cpu events, time ranges and naming of CUDA resources
Capture a particular metric: $ nvprof --metrics achieved_oocupancy, executed_ipc -o nvprof.out <application_name>
Performance analysis of specific kernel running on the GPU, type:
$ nvprof --kernel <kernel-name> --analysis-metrics -o analysis.prof <application_name>
It will produce analysis.perf file that can be viewed using Nvidia Visual Profiler
Caution: nvprof metric option may negatively affect performance characteristics of function running on GPU as it may cause all kernel executions to be serialized on GPU.
CUDA Environment Setup
Nvidia Validation Suite: http://docs.nvidia.com/deploy/nvvs-user-guide/index.html#nvidia-validation-suite-goals
$ nvvs -g
$ nvvs -c Amazon_EC2_p3.16xlarge.conf -d 5 -1 debug.log$nvvs -c Tesla_K40c_quick.conf
Cuda repos for Ubuntu Xenial: http://developer.download.nvidia.com/compute/cuda/repos/
Nvidia Validation Suite consist of a series of plugins:
Deployment plugin: Verify compute environment is ready to run Cuda app and is able to load NVML library
GPU Bandwidth plugin: Measure bandwidth and latency to and from the GPUs and the host
SM Performance Plugin: Bring the GPU to a target performance level in gigaflops by doing large matrix multiplications
// kernel or function that will run on GPU
Nvidia GPU Feature Comparison and Benchmarks
As compare to CPU, GPU has much higher single/double FP compute power and memory bandwidth
- Nvidia Training and Inference Benchmarks: MLPerf, ResNet-50 using: TensorFlow, MXNet, PyTorch
Tensor Cores - Nvidia New Architecture (Volta/Turing GPUs)
Nvidia's Volta and Turing architecture introduced Tensor Cores, that are specialized execution units designed specifically for accelerating the tensor (matrix) operations, important compute functions used in Deep Learning training and inference.For graphic acceleration, Tensor cores uses a technique called Deep Learning Super Sampling (DLSS). DLSS leverages a DNN to extract multi-dimensional features of the rendered scene and intelligently combine details from multiple frames to construct a high quality final image. All this capabilities with fewer input samples and reduced algorithmic complexity, when dealing with transparency and other complex scene elements.
Turning SM provides independent floating point and integer data path that run in parallel. In previous generations, these instructions would have blocked floating-point instructions from issuing. This feature accelerate deep learning inference applications that use mix of computation and address calculation.
One of the big advantages of Deep Learning (DL) is that model can be trained at high precision and implemented at lower precision without sacrificing accuracy. T4 is priced lower than Volta GPUs. Even though, T4 is slower than Volta GPU in DL Training Benchmarks (ResNet-50 image classification on CNNs, MLPerf ), T4 is at par and even better than Volta in DL Powered Inference due to enhanced DL capabilities.
Tensor cores are purpose-built to accelerate multi-precision inference performance New Streaming Multiprocessor (SM) with Turing Tensor Cores provides FP16/FP32 mixed-precision matrix math. There is also new INT8/INT4 precision modes that support fast INT8 matrix operations to improve throughput with minimal loss in accuracy. Ability to do reduced-precision inference can significantly lowers application latency while preserving model accuracy, required for embedded application.
T4 Turing offers Unified shared memory and L1 cache, that simplify programming and boost performance. It also supports: address space isolation and independent thread scheduling that enable finer grain synchronization and cooperation