Friday, December 27, 2019

General Purpose GPU Computing

General Purpose GPU Computing
GPU and CPU Comparison
Nvidia CUDA Architecture
CUDA Unified Memory and Address Space (UVA)
GPU Memory Hierarchy
Streaming Multiprocessors (SM)
GPU Performance Considerations
GPU and CPU Monitoring Differences
Nvidia GPU Metrics
Nvidia GPU Profiling
CUDA Environment Setup
GPU Sample Program
GPU Benchmark
Tensor Cores - Nvidia New Turing Architecture (Volta/T4 GPUs)

General Purpose GPU Computing

General Purpose GPU computing paradigm was started when Nvidia introduced CUDA (Compute Unified Device Architecture) compliant GPUs, that enables them to be programmed via high level programing languages like C and C++. In GPU-accelerated computing, program utilizes GPUs in conjunction with CPUs to accelerate compute heavy tasks, common in the fields of: AI, Machine Learning, Deep learning, Analytics, and many other engineering applications. Although application code runs on a cpu, it can offload compute intensive portions of the program, called kernel, to GPU. Net result is an overall performance boost. No knowledge of graphics programming (OpenGL or DirectX) is required when coding in CUDA language. One may have to learn modestly extended version of C.

Graphics APIs use high-level shading languages: DirectX, OpenGL and Cg. CUDA, on the other hand, can help writing general purpose extensions in C, that exploit massively parallel processing of GPU. This type of GPU programming is a clear separation from the early GPU programming model. CUDA offers flexible way of programming GPU that allows new algorithms to be developed and deployed quickly and efficiently

GPU with high memory bandwidth, thread parallelism (parallelism can help hide memory latencies), and abundance of fast register and L1 cache memory can able to improve matrix multiplication algorithm by storing larger datasets closer to the processing engine and that makes GPU ideal platform for Machine and Deep Learning. If a problem can be solved as a data-parallel computations, then each data elements can run in parallel. No sophisticated data flow and large caches are required, as found in CPU, considering the same function can be applied to each data element in parallel and the access latency can be hidden due to same calculation across all elements with high arithmetic intensity.

Many applications that process large datasets can use data-parallel programming model to speed up computations. Parallel data processing maps data elements to parallel processing threads that offers higher arithmetic intensity, where arithmetic intensity is simply a ratio of arithmetic operations to memory operations. For example:
  • In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. 
  • Image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, vision and pattern recognition can map image blocks and pixels to parallel processing threads.
  • Neural networks (DNN, CNN, RNN) are built using large numbers of identical interconnected neurons distributed across multi layer networks, they are highly parallel by nature. This sort of parallelism maps naturally to GPUs, which offers remarkable speedup over CPU-only training. Neural networks rely heavily on matrix math operations and require floating-point performance and memory bandwidth. GPUs have thousands of processing cores optimized for matrix math operations, providing tens to hundreds of TFLOPS of performance. 
  • Internal model training benchmarks showed training (FFD, LSTM, CNN) time can be reduced several folds with GPU when compared to CPU.

CPU and GPU Comparison

A typical Intel/AMD CPU consists of few cores and optimized for sequential serial processing.
GPU has thousands of smaller cores designed specifically for handling compute intensive parallel tasks simultaneously.
CPUs are clocked at higher speed (> 2 GHz)
GPUs are normally clocked at a lower speed
Physical cpu socket is directly attached to system bus
Physical gpu board is connected to a system bus via PCI-e (PCI Express) bus. For multi-GPU configuration, GPU-GPU communication is possible via NVLINK, that has 300 Gbps throughput as compared to 30 Gbps on PCIe link.
Physical cpu socket can have multiple logical cores (4-30). Each core can have 2 Hyper-threads (HT). Thus each core can execute maximum of two threads simultaneously
Physical gpu is logically divided into 10-20 Streaming Multiprocessors (SM). Each SM can have hundreds of cores, and that adds up to cores count into thousands. Each core can do one thread of execution
Each CPU core has a dedicated on-core (L1, L2) caches and a larger off-core L3 cache, that is shared by all cores in CPU
Each SM (not core) in GPU has on-chip 512 KB register file and128 KB shared memory. There is a Off-chip 1.5 MB of L2 memory shared by all SM
CPU cores run independently of each other
GPU cores run 32 threads (called warp) in a lock step mode. All threads in the warp start together at the same program address. Each thread has its own instruction address counter and register state and is free to branch and execute independently.

Nvidia CUDA Architecture

CUDA is a combination of hardware and software architecture that enables Nvidia GPUs to execute programs written in C, C++, Fortran, OpenCL, DirectCompute and other languages. Nvidia GPUs are built around CUDA architecture.
CUDA uses parallel programming model that breaks the compute intensive task into hundreds or even thousands of parallel task that can run concurrently across thousands of cores in GPU. Nvidia GPUs are optimized for throughput not latency. CUDA treats GPU as a coprocessor. CUDA program requires copying data from CPU memory to GPU memory and vice versa. CUDA program strives to minimize data movement between CPU and GPU and in turn increase GPU utilization. CUDA programs with the help of Nvidia drivers transfer data to/from GPU via DMA (Direct Memory Access) transfer. It is alway CPU that initiates data transfer in both direction. GPU writes results into DMA buffer (part of system RAM) for cpu to pick it up. GPU program performs following high level operations:
  • CPU allocates memory on GPU (aka. device) by calling cudaMalloc()
  • CPU copies input data from CPU memory to device memory by calling cudaMemcpy()
  • CPU launches a function, called kernel, on device to process the input data
  • CPU copies results back from device memory to CPU memory by calling cudaMemcpy()

GPU programmers write a function, called kernel, that runs in parallel on the GPU. A kernel executes in parallel across a set of parallel threads. Kernel function is applied to each element or record in the stream. Kernel functions are usually pipelined and local on-chip memory reuse is attempted for optimal performance in order to minimize the loss in bandwidth due to external memory interaction
A kernel executes in parallel across a set of parallel threads. In CUDA parallel programming model, each thread has a per-thread private memory space used for register spills, function calls, and C automatic array variables. 
Thread Block
Programmers or compilers organize these threads in thread blocks. A thread block is a set of concurrently executing threads that can cooperate among themselves through barrier synchronization and shared memory. Each thread block has a per-Block shared memory space used for inter-thread communication, data sharing, and result sharing in parallel algorithms. Each thread within a thread block executes an instance of the kernel and has a thread ID within its thread block, program counter, registers, per thread private memory, inputs and output results. Shared memory enables threads within the same thread block to cooperate, facilitates extensive reuse of on-chip data and greatly reduces off-chip traffic. Shared memory is a key enabler for many high performance CUDA apps 
The GPU instantiates a kernel program on a grid of parallel thread blocks. A grid is an array of thread blocks that execute the same kernel, read inputs from global memory, write results to global memory, and synchronize between dependent kernel calls. Grids of thread blocks share results in Global Memory (device memory) space after kernel-wide global synchronization. A thread block has a block ID within its grid. 

CUDA Unified Memory and Address Space (UVA)

CPU and GPU memory are physically separated by PCI-Express bus. Data sharing between CPU and GPU require program to explicitly allocate and copy data between two physical memory locations. CUDA unified memory model allows sharing of CPU and GPU memory using a single pointer. Unified memory looks like CPU memory to code running on the CPU, and GPU memory to code running on the GPU. System transparently migrates data allocated in Unified Memory between host (CPU) and device (GPU) when accessed. To allocate space in unified memory pool, program calls cudaMallocManaged() routine.
Unified Address Space (UVA) enables "Zero-Copy" memory, which is pinned (locked) host memory accessible by code running on device directly over PCI-Express bus, without requiring memcpy(). This feature allows an application to use single address space no matter where it is running: host or device. Memory allocated via CUDA API residing on host or device is mapped to the same virtual address space.Location of memory on host or device with UVA can be determined by calling cudaPointerGetAttributes() and passing a pointer to the address. Application may also query if UVA is used for a particular device by checking unifiedAddressing device property is set to 1. UVA allows cudaMemcpy() to be used without specifying where exactly input and output parameters resides.

NOTE: Unified Memory support eliminated "deep copy",  an expensive operation. For example:  to use the struct of data elements on the device, one has to copy all the data members in struct, in addition copy all the data that the struct points to, and then update all the pointers in that copy of the struct, resulting in complexity of passing data elements to a kernel function. Unified Memory simplifies it by eliminating this extra step as it operates on the same pointer as the host code.

Streaming Multiprocessors (SM)

Nvidia GPU architecture is built around a scalable array of multi-threaded Streaming Multiprocessors (SM). Program partitioned into blocks of threads that run independent of each other. A gpu with more SM finish in less time. A GPU executes one or more kernel grids and supports Single-Instruction-Multiple-Thread (SIMT) execution model where multiple independent threads execute a single instruction concurrently. Instructions are pipelined to leverage instruction-level parallelism within a single thread, as well as thread-level parallelism through simultaneous hardware multithreading.
A multi-processor (SMX) is designed to execute hundreds of threads concurrently. When a CUDA program on the host invokes a kernel grid, the blocks of the grid are enumerated and distributed to multi-processors (SMX) in GPU with available execution capacity. Multiple thread blocks can execute concurrently on one multiprocessor. As Thread Blocks terminate, new blocks are launched on the vacated multiprocessors
 GPU with more multiprocessors will automatically execute the program in less time than a GPU with fewer multiprocessors.

SMX creates, manages, schedules and executes threads in group of 32 parallel threads called warps. When a multiprocessor is given one or more Thread Blocks to execute, it partitions them into warps. How warps are used by SMX are described below:
  • Each SMX has a set of 32-bit registers that are partitioned among the warps, and a parallel data cache or shared memory is partitioned among the Thread Blocks.
  • Individual threads composing a warp start together at the same program address, Each thread in a warp has its own instruction address counter and register state and are therefore free to branch and execute independently. 
  • A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the thread converge back to the same execution path. Branch divergence occurs only within a warp. Different warps execute independently regardless of whether they are executing common or disjoint code paths. 
  • Threads of a warp that are on the warp's current execution path are called the active threads, whereas thread not on the current path are inactive(disabled). Threads can be inactive because: 
    • They have exited earlier than other threads of their warp. 
    • They are on a different branch path than the branch path currently executed by the warp. 
    • Last threads of a block whose number of threads is not a multiple of the warp size
  • If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized write that occur to that location varies depending on the compute capability of the device and which thread performs the final write is undefined. 
  • The execution context (program counter, registers, etc.) for each warp processed by a SMX is maintained on-chip during the entire lifetime of the warp. Therefore switching from one execution context to another has no cost, and at every instruction issue time, a warp scheduler selects a warp that has threads ready to execute its next instruction and issues the instruction to those threads. 
  • Thread block is partitioned into warps. Each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. 
  • Number of blocks and warps that can be processed together on the SMX for a given kernel depends on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on the SMX. 
  • Compute capability of GPU depends on number of warps per SMX, registers and shared memory available on the multiprocessor. It differs on GPU type. Kernel fail to launch if there are not enough registers or shared memory available per SMX to process at least one Thread Block.

GPU Memory Hierarchy

Thread can access data from various memory spaces on GPU

Register File
Each SMX has 512 KB of memory across set of 32-bit registers partitioned among warps.
L1, L2 Cache
Local memory access are cached in L1 and L2. Compiler places large structures or arrays that would consume too much register space in Local cache. Also, if kernel uses more variables than registers available are also placed in L1 cache, called register spilling. There is a L1 cache per SMX. L2 cache is shared by all SMX. L1 cache is used to cache access to local memory, including temporary register spills. L2 is used for cache accesses to local and global memory.
Data Cache (Read-Only)
Each SMX has a 48 KB read-only data cache to speed up reads from device memory. SMX accesses this cache either directly or via a texture unit that implements the various addressing modes and data filtering.

Each SMX multiprocessor has a read-only constant cache that is shared by all functional units and speeds up reads from the constant memory space, which resides in device memory.
Shared Memory
Shared memory is on-chip memory and thus has higher bandwidth and lower latency than L1,L2 and global memory. Shared memory latency is 100x times less than global memory. Shared memory is allocated per thread block, so all threads within block has access to shared memory. Shared memory is divided into equal-sized memory modules, called banks, which can be accessed simultaneously to achieve higher overall throughput. If two addresses of a memory request fall in the same memory bank, it results in bank conflict and result in serialized access.
Constant Memory
Constant memory resides in device memory and is cached in constant cache.
Global Memory (DRAM)
Global memory resides on GPU and accessed via 32,64,128-byte memory transactions. When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more memory transactions depending on the size of the word accessed by each thread and distribution of the memory addresses across the threads.

GPU Performance Considerations

Consider following guidelines to improve GPU utilization and in turn reduce model training time.
  • Make conscious design decision to reduce host and device memory transfers that can improve GPU utilization. Some Machine Learning algorithms or models can be executed completely on GPU and does not require CPU computation. For example: all stages of the tree construction (Decision Tree algorithm ) can be efficiently completed on GPU. 
All stages of training running on GPU 

  • Consider using Gradient Boosting for training your model to achieve higher accuracy for regression and classification tasks. Popular library for applying Gradient Boosting is XGBoost. XGBoost stands for Extreme Gradient Boosting algorithm, allows faster and scalable training in a multi-gpu setup. XGBoost library makes it possible to run all phases of training on GPU: gradient calculation, feature quantization, prediction, decision tree construction and evaluation. Some XGBoost library features:
    • Training the model requires whole dataset to fit in the combined memory of all GPU devices. Data compression techniques used by XGBoost reduces the GPU memory usage and allows much bigger data sets to be trained in parallel. 
    • XGBoost uses symbol compression to store the quantized input matrix on the device. Matrix values are compressed down to log2(max_value) bits, where max_value is the maximum integer value of any quantized matrix element. Data is not modified once on the device and is read many times. The small number of bitwise operations computed on the GPU incur no visible performance penalty. This bit compression method typically reduces GPU memory consumption by 4x or more over the standard floating point representation, thus allows training on significantly larger datasets.
    • XGBoost on gpu works on a quantile representation of the input feature space. Having features in quantized form reduces the tree construction problem to one gradient summation into histograms, speeding up execution time. The feature x is binned, so that in each bin there is roughly the same number of data-points. Reduce the computational complexity of finding the best split from O(nfeatures x ninstances) to O(nfeatures x nbins) with  nbins <<  ninstances
    • Internal benchmarks reported model training time reduction of up to 20x when using xgboost library
  • Scaling computation from one GPU to multiple in a single machine can enable much faster model training. Horovod Library makes it fast and easy (minimum code changes) to do distributed deep learning in TensorFlow. New Nvidia GPUs supports NVLink for inter-gpu communication, that offers 10x times higher throughput than PCIe bus.
  • Train model with mixed precision. Nvidia new Tensor cores (Volta/Turing GPU) offer hardware acceleration for mixed precision training. Lower precision than 32-bit floating point requires less memory and bandwidth. Math operations run faster in reduced precision. Up to 3% speed up is possible with mixed precision training with Volta/Turning architecture.
  • Simple and small algorithms can be used as a building blocks for massively parallel algorithms. GPU primitives may be used to compose more complicated algorithms while retaining high performance, readability and reliability. Some examples of parallel primitives:
  • GPUs are optimized for 32-bit floating point operations, but not for 64-bit double precision. Consider 32-bit parallel and sequential summation. 32-bit parallel summation shows dramatically superior numerical stability considering error of parallel summation grows proportionally to O(logn), as compared to O(n) for sequential summation 

Nvidia GPU Metrics

Processor Utilization
Idle if no thread is running or ready to run on cpu. Tool: cat /proc/stat
idle if no kernel (cuda functions) running on gpu. GPU allows multiple kernels to execute concurrently. Timeline View in visual profiler shows status of all kernel and stream on GPU.
Tool: nvprof and NVML library.
Processor Cores Utilization
workload concurrency. If application threads/processes are using all vcpu (cores).
Tool: pidstat, top, ps
CUDA applications manage concurrency by running async commands in streams. Where stream is a sequence of commands that execute in order. In multi-threaded program, each threads can call kernels that will be executed concurrently. Kernel parameters like: grid and block sizes are also reported to know level of concurrency used in the kernel.
Tool: nvprof, nvvp
Thread states
Threads states can be: running, sleeping (stalled), blocking on IO. Tool: ps, pidstat
Threads run in group of 32, called warp, on SM (multiprocessor). Threads can be in two states: active, inactive or stalled. There are metrics that nvprof can query to find if one or more threads are active or stalled.
Tool: nvprof, nvvp.
instructions per cycle. Tool: Linux perf stat
Instruction per cycle across gpu; instructions issued and executed per cycle per SM and per warp. SM can schedule 4 concurrent warps. Tool: nvprof.
Cache usage
cpu L1, L2 cache usage: hit, miss and usage. Tool: Linux perf stat
gpu registers and Shared memory usage. L1 and L2 cache hit/miss and usage. Tool: nvprof, nvvp
System level memory (RAM) usage: Tool: free, vmstat
Global memory usage: total, free, used.
Tool: nvprof, NVML library
Network and storage throughput: Tool: sar, iostat
No network or storage attached to GPU. Throughput and transaction count are reported for PCI-E bus: read/write to system memory (RAM) via DMA, device global memory, L1-L2 caches and on-chip shared memory. Tool: nvprof, nvvp
Active process, threads
Reports process/threads running on cpus: Tool: top
Reports kernels (cuda functions) running on GPU. Tools: nvprof, nvvp.
Active functions
Application or Linux kernel function running on cpus:
Tool: Linux perf top
For each kernel, one can list: CUDA functions executed on device, time spend in memcpy (device <-> host) and running CUDA functions on gpu. Tool: nvprof, nvvp

Nvidia GPU Metrics

There are some basic metrics that can be fetched via Netflix Vector or Grafana, front ends to PCP open source package to find GPU utilization.
Percent of time in last sample period, one or more kernels were executing on the GPU
Percent of time in last sample period device memory was being read and written
memory allocated on gpu
total global memory on gpu
free global memory on gpu
power states (P0 means full power) on gpu
number of physical gpu attached to system
gpu number 0-7
PCI-e bus id where gpu is connected
card name
gpu temperature
gpu fan speed

Nvidia GPU Profiling

One can profile functions (kernel) running on GPU and isolate the cause of GPU bottlenecks and low utilization via Nvidia tools: nvprof (cli),  Visual Profiler (GUI) for annotating events, code ranges and application resources. Also  Nvidia C API, called  NVTX , can be integrated into the program to: capture, visualize (via Visual Profiler)  and trace cpu events, time ranges and naming of CUDA resources
Nvidia profiling tool nvprof (similar to Linux perf for host cpu) can capture profiling samples, events and various GPU metrics as listed below.

instructions executed per cycle 
instructions executed per cycle for a single SM 
Average number of instructions executed by each warp 
instructions issued per cycle 
Ratio of average active wraps per active cycle to the maximum 
number of warps supported on SM
Utilization level on scale of 0 to 10 of SM function units that 
execute integer and floating point arithmetic instructions 
Ratio of average active threads per warp to the maximum number of 
threads per warp supported on SM. Reported as %age
Ratio of non-divergent branches to total branches expressed as 
GPU device memory read tput and read transactions 
GPU device memory write tput and write transactions 
Device memory utilization on scale of 0 to 10 relative to the peak 
number of metrics for single and double precision floating point 
operation stats: add, multiply-accumulate, multiply etc.. 
Hit rate in L1 cache for global loads 
Hit rate in L1 cache for local loads and storage
Utilization level on scale 0 to 10 for L1/shared memory utilization
 relative to peak utilization
Memory read throughput and transactions seen at L2 cache for all read requests 
Memory write throughput and transactions seen at L2 cache for
 all write requests 
Number of issued and executed load and store instructions 
global memory load throughput and transactions
Local memory store throughput and transactions 
shared memory load and store transactions 
Utilization of system memory on scale of 0 to 10 relative to peak
system memory read throughput and transactions 

system memory write throughput and transactions 
percentage of stalls occurring because the warp is blocked
 at a __syncthread() call

Event: Countable activity, action, or occurrence on a device and corresponds to a single hardware counter value,  collected when kernel execute. $nvprof --query-events.
Metric: Calculated from one or more event values to represent some GPU characteristics. $nvprof --query-metrics.

Capture a particular metric: $ nvprof --metrics achieved_oocupancy, executed_ipc -o nvprof.out <application_name>

Performance analysis of specific kernel running on the GPU, type:
 $ nvprof --kernel <kernel-name> --analysis-metrics -o <application_name>
It will produce analysis.perf file that can be viewed using Nvidia Visual Profiler

Caution: nvprof metric option may negatively affect performance characteristics of function running on GPU as it may cause all kernel executions to be serialized on GPU.

CUDA Environment Setup

Machine Learning pipeline is composed of many stages: Data ingestion, exploration, feature generation, data cleansing, model training, validation, and lastly publishing and collecting performance metrics. Having a CI/CD pipeline can automate the process of model retraining and deploying in production as a microservices.
to build C program to run on CPU
$ gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
Linux headers
$ sudo apt-get install linux-headers-$(uname -r)
cuda build environment and tools
$sudo apt-get install cuda
Installs nvdia driver and cuda binaries and libraries.
This will also build the kernel and initrd files in /boot and update the grub.
$ sudo apt-get install nvidia-cuda-toolkit
Latest nvidia drivers:
$ wget
$ sudo ./
Sets env to find cuda binaries and libraries
Update PATH and LD_LIBRARY_PATH environment variables in /etc/environment file


LD_LIBRARY_PATH=/usr/local/cuda-8.0/lib /mnt
Installs cuda sample programs
Run the command below to install CUDA sample programs in /mnt directory.
$ /usr/local/cuda-8.0/bin/ /mnt
After successful install, you should see the following folders created in directory: /mnt/NVIDIA_CUDA-8.0_Samples
$ ls
0_Simple  1_Utilities  2_Graphics 3_Imaging  4_Finance 5_Simulations  6_Advanced 7_CUDALibraries  common EULA.txt Makefile
check if nvidia device driver is loaded and device nodes are created
To confirm nvidia drivers are loaded run:
$ nvidia-sm

$ lsmod | grep nvidia
$ cat /proc/driver/nvidia/version

List all GPU device attached to PCIe bus
List GPU to nvidia driver binding
$lspci -v

Files used by CUDA to communicate with the kernel-mode portion of the NVIDIA Driver.
$ ls -lt /dev/nv*
crw-rw-rw- 1 root root 245, 0 Jul 18 23:05 /dev/nvidia-uvm
crw-rw-rw- 1 root root 195, 0 Jul 18 23:05 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Jul 18 23:05 /dev/nvidiactl

NOTE: Set Nvidia GPU to run at higher clock all the time by disabling
auto-boost feature, For Nvidia Tesla T4 run:
$ sudo nvidia-smi -ac 5001,1590
Verify install by querying GPU device
Binaries are placed in the /mnt/NVIDIA_CUDA-8.0_Samples/bin directory
$ cd /mnt/NVIDIA_CUDA-8.0_Samples
$ make
$ cd /mnt/NVIDIA_CUDA-8.0_Samples/bin/mnt/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release
$ ./deviceQuery

Nvidia Validation Suite:
$ nvvs -g
$ nvvs -c Amazon_EC2_p3.16xlarge.conf -d 5 -1 debug.log$nvvs -c Tesla_K40c_quick.conf

Cuda repos for Ubuntu Xenial:

Nvidia Validation Suite consist of a series of plugins:
Deployment plugin: Verify compute environment is ready to run Cuda app and is able to load NVML library
GPU Bandwidth plugin: Measure bandwidth and latency to and from the GPUs and the host
SM Performance Plugin: Bring the GPU to a target performance level in gigaflops by doing large matrix multiplications

GPU Sample Program

Sample program perform cube of first 64 numbers
#include <stdio.h>

// kernel or function that will run on GPU
__global__ void cube(float * d_out, float * d_in){
int idx = threadIdx.x;
float f = d_in[idx];
d_out[idx] = f * f * f;

int main(int argc, char ** argv) {
const int ARRAY_SIZE = 96;
const int ARRAY_BYTES = ARRAY_SIZE * sizeof(float);

 // initialize the input array on the host
float h_in[ARRAY_SIZE];
for (int i = 0; i < ARRAY_SIZE; i++) {
h_in[i] = float(i);

float h_out[ARRAY_SIZE];
// declare GPU memory pointers
float * d_in;
float * d_out;

// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);

// transfer the array to the GPU
cudaMemcpy(d_in, h_in, ARRAY_BYTES, cudaMemcpyHostToDevice); 

// launch the kernel
cube<<<1, ARRAY_SIZE>>>(d_out, d_in);

// copy back result array to the CPU
cudaMemcpy(h_out, d_out, ARRAY_BYTES, cudaMemcpyDeviceToHost);

// print out the resulting array

for (int i =0; i < ARRAY_SIZE; i++) {
printf("%f", h_out[i]);
printf(((i % 4) != 3) ? "\t" : "\n");


return 0;

Compile and run:
$ nvcc -o cube
$ ./cube
0.000000 1.000000 8.000000 27.000000
64.000000 125.000000 216.000000 343.000000
512.000000 729.000000 1000.000000 1331.000000
1728.000000 2197.000000 2744.000000 3375.000000
4096.000000 4913.000000 5832.000000 6859.000000
8000.000000 9261.000000 10648.000000 12167.000000
13824.000000 15625.000000 17576.000000 19683.000000
21952.000000 24389.000000 27000.000000 29791.000000
32768.000000 35937.000000 39304.000000 42875.000000
46656.000000 50653.000000 54872.000000 59319.000000
64000.000000 68921.000000 74088.000000 79507.000000
85184.000000 91125.000000 97336.000000 103823.000000

Nvidia GPU Feature Comparison and Benchmarks

As compare to CPU, GPU has much higher single/double FP compute power and memory bandwidth

  • Nvidia Training and Inference Benchmarks: MLPerf, ResNet-50 using: TensorFlow, MXNet, PyTorch 

Tensor Cores - Nvidia New Architecture (Volta/Turing GPUs)

Nvidia's Volta and Turing architecture introduced Tensor Cores, that are specialized execution units designed specifically for accelerating the tensor (matrix) operations, important compute functions used in Deep Learning training and inference.

For graphic acceleration, Tensor cores uses a technique called Deep Learning Super Sampling (DLSS). DLSS leverages a DNN to extract multi-dimensional features of the rendered scene and intelligently combine details from multiple frames to construct a high quality final image. All this capabilities with fewer input samples and reduced algorithmic complexity, when dealing with transparency and other complex scene elements. 

Turning SM provides independent floating point and integer data path that run in parallel. In previous generations, these instructions would have blocked floating-point instructions from issuing. This feature accelerate deep learning inference applications that use mix of computation and address calculation. 

One of the big advantages of Deep Learning (DL) is that model can be trained at high precision and implemented at lower precision without sacrificing accuracy. T4 is priced lower than Volta GPUs. Even though,  T4 is slower than Volta GPU in DL Training Benchmarks (ResNet-50 image classification on CNNs, MLPerf ), T4 is at par and even better than Volta in DL Powered Inference due to enhanced DL capabilities.  

Tensor cores are purpose-built to accelerate multi-precision inference performance New Streaming Multiprocessor (SM) with Turing Tensor Cores provides FP16/FP32 mixed-precision matrix math. There is also new INT8/INT4 precision modes that support fast INT8 matrix operations to improve throughput with minimal loss in accuracy.  Ability to do reduced-precision inference can significantly lowers application latency while preserving model accuracy, required for embedded application.

T4 Turing offers Unified shared memory and L1 cache, that simplify programming and boost performance. It also supports: address space isolation and independent thread scheduling that enable finer grain synchronization and cooperation