Linux Performance in Cloud: 2019

General Purpose GPU Computing
GPU and CPU Comparison
Nvidia CUDA Architecture
CUDA Unified Memory and Address Space (UVA)
GPU Memory Hierarchy
Streaming Multiprocessors (SM)
GPU Performance Considerations
GPU and CPU Monitoring Differences
Nvidia GPU Metrics
Nvidia GPU Profiling
CUDA Environment Setup
GPU Sample Program
GPU Benchmark
Tensor Cores - Nvidia New Turing Architecture (Volta/T4 GPUs)

General Purpose GPU Computing

General Purpose GPU computing paradigm was started when Nvidia introduced CUDA (Compute Unified Device Architecture) compliant GPUs, that enables them to be programmed via high level programing languages like C and C++. In GPU-accelerated computing, program utilizes GPUs in conjunction with CPUs to accelerate compute heavy tasks, common in the fields of: AI, Machine Learning, Deep learning, Analytics, and many other engineering applications. Although application code runs on a cpu, it can offload compute intensive portions of the program, called kernel, to GPU. Net result is an overall performance boost. No knowledge of graphics programming (OpenGL or DirectX) is required when coding in CUDA language. One may have to learn modestly extended version of C.

Graphics APIs use high-level shading languages: DirectX, OpenGL and Cg. CUDA, on the other hand, can help writing general purpose extensions in C, that exploit massively parallel processing of GPU. This type of GPU programming is a clear separation from the early GPU programming model. CUDA offers flexible way of programming GPU that allows new algorithms to be developed and deployed quickly and efficiently

GPU with high memory bandwidth, thread parallelism (parallelism can help hide memory latencies), and abundance of fast register and L1 cache memory can able to improve matrix multiplication algorithm by storing larger datasets closer to the processing engine and that makes GPU ideal platform for Machine and Deep Learning. If a problem can be solved as a data-parallel computations, then each data elements can run in parallel. No sophisticated data flow and large caches are required, as found in CPU, considering the same function can be applied to each data element in parallel and the access latency can be hidden due to same calculation across all elements with high arithmetic intensity.

Many applications that process large datasets can use data-parallel programming model to speed up computations. Parallel data processing maps data elements to parallel processing threads that offers higher arithmetic intensity, where arithmetic intensity is simply a ratio of arithmetic operations to memory operations. For example:

In 3D rendering, large sets of pixels and vertices are mapped to parallel threads.
Image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, vision and pattern recognition can map image blocks and pixels to parallel processing threads.
Neural networks (DNN, CNN, RNN) are built using large numbers of identical interconnected neurons distributed across multi layer networks, they are highly parallel by nature. This sort of parallelism maps naturally to GPUs, which offers remarkable speedup over CPU-only training. Neural networks rely heavily on matrix math operations and require floating-point performance and memory bandwidth. GPUs have thousands of processing cores optimized for matrix math operations, providing tens to hundreds of TFLOPS of performance.
Internal model training benchmarks showed training (FFD, LSTM, CNN) time can be reduced several folds with GPU when compared to CPU.

CPU and GPU Comparison

CPU	GPU
A typical Intel/AMD CPU consists of few cores and optimized for sequential serial processing.	GPU has thousands of smaller cores designed specifically for handling compute intensive parallel tasks simultaneously.
CPUs are clocked at higher speed (> 2 GHz)	GPUs are normally clocked at a lower speed
Physical cpu socket is directly attached to system bus	Physical gpu board is connected to a system bus via PCI-e (PCI Express) bus. For multi-GPU configuration, GPU-GPU communication is possible via NVLINK, that has 300 Gbps throughput as compared to 30 Gbps on PCIe link.
Physical cpu socket can have multiple logical cores (4-30). Each core can have 2 Hyper-threads (HT). Thus each core can execute maximum of two threads simultaneously	Physical gpu is logically divided into 10-20 Streaming Multiprocessors (SM). Each SM can have hundreds of cores, and that adds up to cores count into thousands. Each core can do one thread of execution
Each CPU core has a dedicated on-core (L1, L2) caches and a larger off-core L3 cache, that is shared by all cores in CPU	Each SM (not core) in GPU has on-chip 512 KB register file and128 KB shared memory. There is a Off-chip 1.5 MB of L2 memory shared by all SM
CPU cores run independently of each other	GPU cores run 32 threads (called warp) in a lock step mode. All threads in the warp start together at the same program address. Each thread has its own instruction address counter and register state and is free to branch and execute independently.

Nvidia CUDA Architecture

CUDA is a combination of hardware and software architecture that enables Nvidia GPUs to execute programs written in C, C++, Fortran, OpenCL, DirectCompute and other languages. Nvidia GPUs are built around CUDA architecture.
CUDA uses parallel programming model that breaks the compute intensive task into hundreds or even thousands of parallel task that can run concurrently across thousands of cores in GPU. Nvidia GPUs are optimized for throughput not latency. CUDA treats GPU as a coprocessor. CUDA program requires copying data from CPU memory to GPU memory and vice versa. CUDA program strives to minimize data movement between CPU and GPU and in turn increase GPU utilization. CUDA programs with the help of Nvidia drivers transfer data to/from GPU via DMA (Direct Memory Access) transfer. It is alway CPU that initiates data transfer in both direction. GPU writes results into DMA buffer (part of system RAM) for cpu to pick it up. GPU program performs following high level operations:

CPU allocates memory on GPU (aka. device) by calling cudaMalloc()
CPU copies input data from CPU memory to device memory by calling cudaMemcpy()
CPU launches a function, called kernel, on device to process the input data
CPU copies results back from device memory to CPU memory by calling cudaMemcpy()

Kernel	GPU programmers write a function, called kernel, that runs in parallel on the GPU. A kernel executes in parallel across a set of parallel threads. Kernel function is applied to each element or record in the stream. Kernel functions are usually pipelined and local on-chip memory reuse is attempted for optimal performance in order to minimize the loss in bandwidth due to external memory interaction
Thread	A kernel executes in parallel across a set of parallel threads. In CUDA parallel programming model, each thread has a per-thread private memory space used for register spills, function calls, and C automatic array variables.
Thread Block	Programmers or compilers organize these threads in thread blocks. A thread block is a set of concurrently executing threads that can cooperate among themselves through barrier synchronization and shared memory. Each thread block has a per-Block shared memory space used for inter-thread communication, data sharing, and result sharing in parallel algorithms. Each thread within a thread block executes an instance of the kernel and has a thread ID within its thread block, program counter, registers, per thread private memory, inputs and output results. Shared memory enables threads within the same thread block to cooperate, facilitates extensive reuse of on-chip data and greatly reduces off-chip traffic. Shared memory is a key enabler for many high performance CUDA apps
Grid	The GPU instantiates a kernel program on a grid of parallel thread blocks. A grid is an array of thread blocks that execute the same kernel, read inputs from global memory, write results to global memory, and synchronize between dependent kernel calls. Grids of thread blocks share results in Global Memory (device memory) space after kernel-wide global synchronization. A thread block has a block ID within its grid.

CUDA Unified Memory and Address Space (UVA)

CPU and GPU memory are physically separated by PCI-Express bus. Data sharing between CPU and GPU require program to explicitly allocate and copy data between two physical memory locations. CUDA unified memory model allows sharing of CPU and GPU memory using a single pointer. Unified memory looks like CPU memory to code running on the CPU, and GPU memory to code running on the GPU. System transparently migrates data allocated in Unified Memory between host (CPU) and device (GPU) when accessed. To allocate space in unified memory pool, program calls cudaMallocManaged() routine.
Unified Address Space (UVA) enables "Zero-Copy" memory, which is pinned (locked) host memory accessible by code running on device directly over PCI-Express bus, without requiring memcpy(). This feature allows an application to use single address space no matter where it is running: host or device. Memory allocated via CUDA API residing on host or device is mapped to the same virtual address space.Location of memory on host or device with UVA can be determined by calling cudaPointerGetAttributes() and passing a pointer to the address. Application may also query if UVA is used for a particular device by checking unifiedAddressing device property is set to 1. UVA allows cudaMemcpy() to be used without specifying where exactly input and output parameters resides.

NOTE: Unified Memory support eliminated "deep copy", an expensive operation. For example: to use the struct of data elements on the device, one has to copy all the data members in struct, in addition copy all the data that the struct points to, and then update all the pointers in that copy of the struct, resulting in complexity of passing data elements to a kernel function. Unified Memory simplifies it by eliminating this extra step as it operates on the same pointer as the host code.

Streaming Multiprocessors (SM)

Nvidia GPU architecture is built around a scalable array of multi-threaded Streaming Multiprocessors (SM). Program partitioned into blocks of threads that run independent of each other. A gpu with more SM finish in less time. A GPU executes one or more kernel grids and supports Single-Instruction-Multiple-Thread (SIMT) execution model where multiple independent threads execute a single instruction concurrently. Instructions are pipelined to leverage instruction-level parallelism within a single thread, as well as thread-level parallelism through simultaneous hardware multithreading.
A multi-processor (SMX) is designed to execute hundreds of threads concurrently. When a CUDA program on the host invokes a kernel grid, the blocks of the grid are enumerated and distributed to multi-processors (SMX) in GPU with available execution capacity. Multiple thread blocks can execute concurrently on one multiprocessor. As Thread Blocks terminate, new blocks are launched on the vacated multiprocessors

GPU with more multiprocessors will automatically execute the program in less time than a GPU with fewer multiprocessors.

WARPS
SMX creates, manages, schedules and executes threads in group of 32 parallel threads called warps. When a multiprocessor is given one or more Thread Blocks to execute, it partitions them into warps. How warps are used by SMX are described below:

Each SMX has a set of 32-bit registers that are partitioned among the warps, and a parallel data cache or shared memory is partitioned among the Thread Blocks.
Individual threads composing a warp start together at the same program address, Each thread in a warp has its own instruction address counter and register state and are therefore free to branch and execute independently.
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the thread converge back to the same execution path. Branch divergence occurs only within a warp. Different warps execute independently regardless of whether they are executing common or disjoint code paths.
Threads of a warp that are on the warp's current execution path are called the active threads, whereas thread not on the current path are inactive(disabled). Threads can be inactive because:

They have exited earlier than other threads of their warp.
They are on a different branch path than the branch path currently executed by the warp.
Last threads of a block whose number of threads is not a multiple of the warp size

If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized write that occur to that location varies depending on the compute capability of the device and which thread performs the final write is undefined.
The execution context (program counter, registers, etc.) for each warp processed by a SMX is maintained on-chip during the entire lifetime of the warp. Therefore switching from one execution context to another has no cost, and at every instruction issue time, a warp scheduler selects a warp that has threads ready to execute its next instruction and issues the instruction to those threads.
Thread block is partitioned into warps. Each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0.
Number of blocks and warps that can be processed together on the SMX for a given kernel depends on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on the SMX.
Compute capability of GPU depends on number of warps per SMX, registers and shared memory available on the multiprocessor. It differs on GPU type. Kernel fail to launch if there are not enough registers or shared memory available per SMX to process at least one Thread Block.

GPU Memory Hierarchy

Thread can access data from various memory spaces on GPU

Register File	Each SMX has 512 KB of memory across set of 32-bit registers partitioned among warps.
L1, L2 Cache	Local memory access are cached in L1 and L2. Compiler places large structures or arrays that would consume too much register space in Local cache. Also, if kernel uses more variables than registers available are also placed in L1 cache, called register spilling. There is a L1 cache per SMX. L2 cache is shared by all SMX. L1 cache is used to cache access to local memory, including temporary register spills. L2 is used for cache accesses to local and global memory.
Data Cache (Read-Only)	Each SMX has a 48 KB read-only data cache to speed up reads from device memory. SMX accesses this cache either directly or via a texture unit that implements the various addressing modes and data filtering. Each SMX multiprocessor has a read-only constant cache that is shared by all functional units and speeds up reads from the constant memory space, which resides in device memory.
Shared Memory	Shared memory is on-chip memory and thus has higher bandwidth and lower latency than L1,L2 and global memory. Shared memory latency is 100x times less than global memory. Shared memory is allocated per thread block, so all threads within block has access to shared memory. Shared memory is divided into equal-sized memory modules, called banks, which can be accessed simultaneously to achieve higher overall throughput. If two addresses of a memory request fall in the same memory bank, it results in bank conflict and result in serialized access.
Constant Memory	Constant memory resides in device memory and is cached in constant cache.
Global Memory (DRAM)	Global memory resides on GPU and accessed via 32,64,128-byte memory transactions. When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more memory transactions depending on the size of the word accessed by each thread and distribution of the memory addresses across the threads.

GPU Performance Considerations

Consider following guidelines to improve GPU utilization and in turn reduce model training time.

Make conscious design decision to reduce host and device memory transfers that can improve GPU utilization. Some Machine Learning algorithms or models can be executed completely on GPU and does not require CPU computation. For example: all stages of the tree construction (Decision Tree algorithm ) can be efficiently completed on GPU.

All stages of training running on GPU

Consider using Gradient Boosting for training your model to achieve higher accuracy for regression and classification tasks. Popular library for applying Gradient Boosting is XGBoost. XGBoost stands for Extreme Gradient Boosting algorithm, allows faster and scalable training in a multi-gpu setup. XGBoost library makes it possible to run all phases of training on GPU: gradient calculation, feature quantization, prediction, decision tree construction and evaluation. Some XGBoost library features:

Training the model requires whole dataset to fit in the combined memory of all GPU devices. Data compression techniques used by XGBoost reduces the GPU memory usage and allows much bigger data sets to be trained in parallel.
XGBoost uses symbol compression to store the quantized input matrix on the device. Matrix values are compressed down to log2(max_value) bits, where max_value is the maximum integer value of any quantized matrix element. Data is not modified once on the device and is read many times. The small number of bitwise operations computed on the GPU incur no visible performance penalty. This bit compression method typically reduces GPU memory consumption by 4x or more over the standard floating point representation, thus allows training on significantly larger datasets.
XGBoost on gpu works on a quantile representation of the input feature space. Having features in quantized form reduces the tree construction problem to one gradient summation into histograms, speeding up execution time. The feature x is binned, so that in each bin there is roughly the same number of data-points. Reduce the computational complexity of finding the best split from O(nfeatures x ninstances) to O(nfeatures x nbins) with nbins << ninstances
Internal benchmarks reported model training time reduction of up to 20x when using xgboost library

Scaling computation from one GPU to multiple in a single machine can enable much faster model training. Horovod Library makes it fast and easy (minimum code changes) to do distributed deep learning in TensorFlow. New Nvidia GPUs supports NVLink for inter-gpu communication, that offers 10x times higher throughput than PCIe bus.
Train model with mixed precision. Nvidia new Tensor cores (Volta/Turing GPU) offer hardware acceleration for mixed precision training. Lower precision than 32-bit floating point requires less memory and bandwidth. Math operations run faster in reduced precision. Up to 3% speed up is possible with mixed precision training with Volta/Turning architecture.
Simple and small algorithms can be used as a building blocks for massively parallel algorithms. GPU primitives may be used to compose more complicated algorithms while retaining high performance, readability and reliability. Some examples of parallel primitives:

Reduction Harris
Parallel prefix sum (scan)
Radix sort
Segmented scan and reduce
Interleaved sequences: multi-reduce
Interleaved sequences: multi-scan

GPUs are optimized for 32-bit floating point operations, but not for 64-bit double precision. Consider 32-bit parallel and sequential summation. 32-bit parallel summation shows dramatically superior numerical stability considering error of parallel summation grows proportionally to O(logn), as compared to O(n) for sequential summation

Nvidia GPU Metrics

Metrics	CPU	GPU
Processor Utilization	Idle if no thread is running or ready to run on cpu. Tool: cat /proc/stat	idle if no kernel (cuda functions) running on gpu. GPU allows multiple kernels to execute concurrently. Timeline View in visual profiler shows status of all kernel and stream on GPU. Tool: nvprof and NVML library.
Processor Cores Utilization	workload concurrency. If application threads/processes are using all vcpu (cores). Tool: pidstat, top, ps	CUDA applications manage concurrency by running async commands in streams. Where stream is a sequence of commands that execute in order. In multi-threaded program, each threads can call kernels that will be executed concurrently. Kernel parameters like: grid and block sizes are also reported to know level of concurrency used in the kernel. Tool: nvprof, nvvp
Thread states	Threads states can be: running, sleeping (stalled), blocking on IO. Tool: ps, pidstat	Threads run in group of 32, called warp, on SM (multiprocessor). Threads can be in two states: active, inactive or stalled. There are metrics that nvprof can query to find if one or more threads are active or stalled. Tool: nvprof, nvvp.
IPC/CPI	instructions per cycle. Tool: Linux perf stat	Instruction per cycle across gpu; instructions issued and executed per cycle per SM and per warp. SM can schedule 4 concurrent warps. Tool: nvprof.
Cache usage	cpu L1, L2 cache usage: hit, miss and usage. Tool: Linux perf stat	gpu registers and Shared memory usage. L1 and L2 cache hit/miss and usage. Tool: nvprof, nvvp
Memory	System level memory (RAM) usage: Tool: free, vmstat	Global memory usage: total, free, used. Tool: nvprof, NVML library
Network/Storage	Network and storage throughput: Tool: sar, iostat	No network or storage attached to GPU. Throughput and transaction count are reported for PCI-E bus: read/write to system memory (RAM) via DMA, device global memory, L1-L2 caches and on-chip shared memory. Tool: nvprof, nvvp
Active process, threads	Reports process/threads running on cpus: Tool: top	Reports kernels (cuda functions) running on GPU. Tools: nvprof, nvvp.
Active functions	Application or Linux kernel function running on cpus: Tool: Linux perf top	For each kernel, one can list: CUDA functions executed on device, time spend in memcpy (device <-> host) and running CUDA functions on gpu. Tool: nvprof, nvvp

Nvidia GPU Metrics

There are some basic metrics that can be fetched via Netflix Vector or Grafana, front ends to PCP open source package to find GPU utilization.

Metrics	Purpose
nvidia.gpuactive	Percent of time in last sample period, one or more kernels were executing on the GPU
nvidia.memactive	Percent of time in last sample period device memory was being read and written
nvidia.memused	memory allocated on gpu
nvidia.memtotal	total global memory on gpu
nvidia.memfree	free global memory on gpu
nvidia.perfstat	power states (P0 means full power) on gpu
nvidia.numcards	number of physical gpu attached to system
nvidia.gpuid	gpu number 0-7
nvidia.busid	PCI-e bus id where gpu is connected
nvidia.cardname	card name
nvidia.temp	gpu temperature
nvidia.fanspeed	gpu fan speed

Nvidia GPU Profiling

One can profile functions (kernel) running on GPU and isolate the cause of GPU bottlenecks and low utilization via Nvidia tools: nvprof (cli), Visual Profiler (GUI) for annotating events, code ranges and application resources. Also Nvidia C API, called NVTX , can be integrated into the program to: capture, visualize (via Visual Profiler) and trace cpu events, time ranges and naming of CUDA resources

Nvidia profiling tool nvprof (similar to Linux perf for host cpu) can capture profiling samples, events and various GPU metrics as listed below.

Metrics	Purpose
ipc	instructions executed per cycle
ipc_instance	instructions executed per cycle for a single SM
inst_per_wrap	Average number of instructions executed by each warp
issue_ipc	instructions issued per cycle
achieved_occupancy	Ratio of average active wraps per active cycle to the maximum number of warps supported on SM
alu_fu_utilization	Utilization level on scale of 0 to 10 of SM function units that execute integer and floating point arithmetic instructions
warp_execution_efficiency	Ratio of average active threads per warp to the maximum number of threads per warp supported on SM. Reported as %age
branch_efficiency	Ratio of non-divergent branches to total branches expressed as percentage
dram_read_throughput dram_read_transactions	GPU device memory read tput and read transactions
dram_write_throughput dram_write_transactions	GPU device memory write tput and write transactions
dram_utilization	Device memory utilization on scale of 0 to 10 relative to the peak utilization
flop_count_dp_* flop_count_sp_*	number of metrics for single and double precision floating point operation stats: add, multiply-accumulate, multiply etc..
l1_cache_global_hit_rate	Hit rate in L1 cache for global loads
l1_cache_local_hit_rate	Hit rate in L1 cache for local loads and storage
l1_shared_utilization	Utilization level on scale 0 to 10 for L1/shared memory utilization relative to peak utilization
l2_read_throughput l2_read_transactions	Memory read throughput and transactions seen at L2 cache for all read requests
l2_write_throughput l2_write_transactions	Memory write throughput and transactions seen at L2 cache for all write requests
ldst_executed ldst_issued	Number of issued and executed load and store instructions
gld_throughput gld_transactions	global memory load throughput and transactions
local_store_throughput local_store_transactions	Local memory store throughput and transactions
shared_load_transactions shared_store_transactions	shared memory load and store transactions
sysmem_utilization	Utilization of system memory on scale of 0 to 10 relative to peak utilization
sysmem_read_throughput sysmem_read_transactions	system memory read throughput and transactions
sysmem_write_throughput sysmem_write_transactions	system memory write throughput and transactions
stalls_sync	percentage of stalls occurring because the warp is blocked at a __syncthread() call

Event: Countable activity, action, or occurrence on a device and corresponds to a single hardware counter value, collected when kernel execute. $nvprof --query-events.

Metric: Calculated from one or more event values to represent some GPU characteristics. $nvprof --query-metrics.

Capture a particular metric: $ nvprof --metrics achieved_oocupancy, executed_ipc -o nvprof.out <application_name>

Performance analysis of specific kernel running on the GPU, type:
$ nvprof --kernel <kernel-name> --analysis-metrics -o analysis.prof <application_name>
It will produce analysis.perf file that can be viewed using Nvidia Visual Profiler

Caution: nvprof metric option may negatively affect performance characteristics of function running on GPU as it may cause all kernel executions to be serialized on GPU.

CUDA Environment Setup

Machine Learning pipeline is composed of many stages: Data ingestion, exploration, feature generation, data cleansing, model training, validation, and lastly publishing and collecting performance metrics. Having a CI/CD pipeline can automate the process of model retraining and deploying in production as a microservices.

Package	Purpose	Comment
gcc	to build C program to run on CPU	$ gcc --version gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
linux-headers	Linux headers	$ sudo apt-get install linux-headers-$(uname -r)
cuda	cuda build environment and tools	$sudo apt-get install cuda Installs nvdia driver and cuda binaries and libraries. This will also build the kernel and initrd files in /boot and update the grub. $ sudo apt-get install nvidia-cuda-toolkit Latest nvidia drivers: $ wget http://us.download.nvidia.com/tesla/418.40.04/NVIDIA-Linux-x86_64-418.40.04.run $ sudo ./NVIDIA-Linux-x86_64-418.40.04.run
PATH LD_LIBRARY_PATH	Sets env to find cuda binaries and libraries	Update PATH and LD_LIBRARY_PATH environment variables in /etc/environment file PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/cuda-8.0/bin" LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib
cuda-install-samples-8.0.sh /mnt	Installs cuda sample programs	Run the command below to install CUDA sample programs in /mnt directory. $ /usr/local/cuda-8.0/bin/cuda-install-samples-8.0.sh /mnt After successful install, you should see the following folders created in directory: /mnt/NVIDIA_CUDA-8.0_Samples $ ls 0_Simple 1_Utilities 2_Graphics 3_Imaging 4_Finance 5_Simulations 6_Advanced 7_CUDALibraries common EULA.txt Makefile
lsmod	check if nvidia device driver is loaded and device nodes are created	To confirm nvidia drivers are loaded run: $ nvidia-sm $ lsmod \| grep nvidia $ cat /proc/driver/nvidia/version List all GPU device attached to PCIe bus $lspci List GPU to nvidia driver binding $lspci -v Files used by CUDA to communicate with the kernel-mode portion of the NVIDIA Driver. $ ls -lt /dev/nv* crw-rw-rw- 1 root root 245, 0 Jul 18 23:05 /dev/nvidia-uvm crw-rw-rw- 1 root root 195, 0 Jul 18 23:05 /dev/nvidia0 crw-rw-rw- 1 root root 195, 255 Jul 18 23:05 /dev/nvidiactl NOTE: Set Nvidia GPU to run at higher clock all the time by disabling auto-boost feature, For Nvidia Tesla T4 run: $ sudo nvidia-smi -ac 5001,1590
devicequery	Verify install by querying GPU device	Binaries are placed in the /mnt/NVIDIA_CUDA-8.0_Samples/bin directory $ cd /mnt/NVIDIA_CUDA-8.0_Samples $ make $ cd /mnt/NVIDIA_CUDA-8.0_Samples/bin/mnt/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release $ ./deviceQuery

Nvidia Validation Suite: http://docs.nvidia.com/deploy/nvvs-user-guide/index.html#nvidia-validation-suite-goals
$ nvvs -g
$ nvvs -c Amazon_EC2_p3.16xlarge.conf -d 5 -1 debug.log$nvvs -c Tesla_K40c_quick.conf

Cuda repos for Ubuntu Xenial: http://developer.download.nvidia.com/compute/cuda/repos/

Nvidia Validation Suite consist of a series of plugins:
Deployment plugin: Verify compute environment is ready to run Cuda app and is able to load NVML library
GPU Bandwidth plugin: Measure bandwidth and latency to and from the GPUs and the host
SM Performance Plugin: Bring the GPU to a target performance level in gigaflops by doing large matrix multiplications

GPU Sample Program

Sample program perform cube of first 64 numbers

#include <stdio.h>

// kernel or function that will run on GPU

__global__ void cube(float * d_out, float * d_in){

int idx = threadIdx.x;

float f = d_in[idx];

d_out[idx] = f * f * f;

}

int main(int argc, char ** argv) {

const int ARRAY_SIZE = 96;

const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float);

// initialize the input array on the host

float h_in[ARRAY_SIZE];

for (int i = 0; i < ARRAY_SIZE; i++) {

h_in[i] = float(i);

}

float h_out[ARRAY_SIZE];

// declare GPU memory pointers

float * d_in;

float * d_out;

// allocate GPU memory

cudaMalloc((void**) &d_in, ARRAY_BYTES);

cudaMalloc((void**) &d_out, ARRAY_BYTES);

// transfer the array to the GPU

cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice);

// launch the kernel

cube<<<1, ARRAY_SIZE>>>(d_out, d_in);

// copy back result array to the CPU

cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);

// print out the resulting array

for (int i =0; i < ARRAY_SIZE; i++) {

printf("%f", h_out[i]);

printf(((i % 4) != 3) ? "\t" : "\n");

}

cudaFree(d_in);

cudaFree(d_out);

return 0;

}

Compile and run:

$ nvcc -o cube cube.cu

$ ./cube

0.000000 1.000000 8.000000 27.000000

64.000000 125.000000 216.000000 343.000000

512.000000 729.000000 1000.000000 1331.000000

1728.000000 2197.000000 2744.000000 3375.000000

4096.000000 4913.000000 5832.000000 6859.000000

8000.000000 9261.000000 10648.000000 12167.000000

13824.000000 15625.000000 17576.000000 19683.000000

21952.000000 24389.000000 27000.000000 29791.000000

32768.000000 35937.000000 39304.000000 42875.000000

46656.000000 50653.000000 54872.000000 59319.000000

64000.000000 68921.000000 74088.000000 79507.000000

85184.000000 91125.000000 97336.000000 103823.000000

...

Nvidia GPU Feature Comparison and Benchmarks

As compare to CPU, GPU has much higher single/double FP compute power and memory bandwidth

Nvidia Training and Inference Benchmarks: MLPerf, ResNet-50 using: TensorFlow, MXNet, PyTorch

Tensor Cores - Nvidia New Architecture (Volta/Turing GPUs)

Nvidia's Volta and Turing architecture introduced Tensor Cores, that are specialized execution units designed specifically for accelerating the tensor (matrix) operations, important compute functions used in Deep Learning training and inference.

For graphic acceleration, Tensor cores uses a technique called Deep Learning Super Sampling (DLSS). DLSS leverages a DNN to extract multi-dimensional features of the rendered scene and intelligently combine details from multiple frames to construct a high quality final image. All this capabilities with fewer input samples and reduced algorithmic complexity, when dealing with transparency and other complex scene elements.

Turning SM provides independent floating point and integer data path that run in parallel. In previous generations, these instructions would have blocked floating-point instructions from issuing. This feature accelerate deep learning inference applications that use mix of computation and address calculation.

One of the big advantages of Deep Learning (DL) is that model can be trained at high precision and implemented at lower precision without sacrificing accuracy. T4 is priced lower than Volta GPUs. Even though, T4 is slower than Volta GPU in DL Training Benchmarks (ResNet-50 image classification on CNNs, MLPerf ), T4 is at par and even better than Volta in DL Powered Inference due to enhanced DL capabilities.

Tensor cores are purpose-built to accelerate multi-precision inference performance New Streaming Multiprocessor (SM) with Turing Tensor Cores provides FP16/FP32 mixed-precision matrix math. There is also new INT8/INT4 precision modes that support fast INT8 matrix operations to improve throughput with minimal loss in accuracy. Ability to do reduced-precision inference can significantly lowers application latency while preserving model accuracy, required for embedded application.

T4 Turing offers Unified shared memory and L1 cache, that simplify programming and boost performance. It also supports: address space isolation and independent thread scheduling that enable finer grain synchronization and cooperation

Linux Performance in Cloud

Friday, December 27, 2019

General Purpose GPU Computing

General Purpose GPU Computing

CPU and GPU Comparison

Nvidia CUDA Architecture

CUDA Unified Memory and Address Space (UVA)

Streaming Multiprocessors (SM)

GPU Memory Hierarchy

GPU Performance Considerations

Nvidia GPU Metrics

Nvidia GPU Metrics

Nvidia GPU Profiling

CUDA Environment Setup

Nvidia GPU Feature Comparison and Benchmarks

Tensor Cores - Nvidia New Architecture (Volta/Turing GPUs)

Blog Archive