Thursday, September 8, 2016

Exploring NUMA on Amazon Cloud Instances

What is NUMA
NUMA NODES
Cache Coherent NUMA (ccNUMA)
Server Memory Hierarchy
FileSystem Cache Latency
Linux NUMA Policies
Application Design
NUMA Monitoring
NUMA TESTING

What is Numa

In Symmetric Multiprocessor (SMP) systems all physical memory is seen as a single pool. All cpus share hardware resources and same address space – single instance of kernel. When physical memory and IO devices are equidistant in terms of latency from the set of independent physical cpus (sockets), the system is called UMA (Uniform Memory Access). In UMA configuration, all physical cpus access memory from the same memory controller and share the same bus. System configured with a single cpu socket (a socket may have multiple logical cores, each with two hyperthreads) is UMA. Building a large SMP server is difficult due to physical limitation of shared bus and higher bus contention with increase in number of cpus. NUMA (Non Uniform Memory Access) architecture allows designing a bigger system configuration but at a cost of varying memory latencies. System designed with multiple cpu sockets is NUMA. Amazon 8xlarge instances (r3.8xl, i2.8xl, c3.8xl, m4.10xl, x1.32xl ..) and above are all NUMA servers. NUMA is a design trade-off as it introduces higher memory latencies. Primary benefit of NUMA servers is unmatched throughput in all resource dimensions (compute, memory, network, storage)..not possible with UMA SMP servers. Linux kernel is NUMA aware and thus takes into account processor affinity and data locality to keep the memory latency low, if possible. Linux NUMA API libnuma and "numactl" commands allows application to offer hints to Linux kernel on how its memory is managed.
UMA Topology
NUMA Topology

NUMA Nodes

Typically, NUMA systems are made up of a number of nodes, each with its own cpus, memory and IO devices. Nodes are connected via high speed interconnect, for example Intel QPI - Quick Path Interconnect. Each node uses the interconnect to access memory and IO devices on remote cpus.  Each NUMA node acts as a UMA SMP system with fast access to local memory and IO devices but relatively slower access to remote nodes memory. By providing each node with its own local memory, it reduces contention issues associated with a shared memory bus, found in UMA servers and thus allow systems to achieve higher aggregate memory bandwidth.  In general, cost of accessing memory increases with the increase in distance from cpu. Thus, more data fails to be local to the node that will access it, the more memory intensive workload performance will suffer from the architecture. 
How Linux  knows about System Topology:
ACPI (Advanced Configuration and Power Interface) in BIOS builds a System Resource Allocation Table (SRAT), that associates each cpu and block of memory in proximity domain, called numa nodes. SRAT describes the physical configuration and cpu memory architecture, that is which cpu and memory ranges belong to a NUMA node and then use this information to map the memory of each node into a single sequential block of memory address space. Linux kernel uses SRAT to understand which memory bank is local to a physical cpu and attempts to allocate local memory to each cpu. System Locality Information Table (SLIT), build by ACPI, provides a matrix that describes the relative distance (i.e. memory latency) between these proximity domains. In the past, more nodes means overall higher latencies due to distance between nodes.  Modern point-to-point architectures moved away from a ring topology to a full mesh topology thus keeping the fixed hop counts, no matter how large the numa configuration.

Cache Coherent Numa (ccNUMA)

Since each cpu core (physical cpu is made up of multiple logical cores) maintains its own set of caches, it may introduce cache-coherency issues due to multiple copies of same dataCache coherency means any variable that is to be used by cpu must have a consistent value. In other words, load instruction (read of the memory location) must retrieve the last store (write to the memory). Without cache coherency, data modified in one cpu cache may not be visible to other cpu and that may result in data corruption.
SMP requires platform to support cache coherency in the hardware. This is achieved by hardware protocol called cache snooping. Snooping guarantees that all processors will have the same visibility of the current state of data or lock in physical memory. Cache snooping monitors cpu caches for modified data. Cpu caches are divided into equal size storage slots, called cache lines. Data is fetched from physical memory in 64 bytes chunk and placed into cache lines for quick access. When one cpu modifies a cache line, all cpu caches are checked for the same cache line. If found, cache line is invalidated or discarded from the cpu cache. When cpu tries to access data not found in the cache, cache snooping first checks other cpu caches lines. If found, data is accessed from other cpu cache instead of fetching from memory. Thus write invalidate snooping feature, erases all copies of data in all cpu caches before writing to local cpu cache. This results in cache miss for invalidated cache line in other cpu local cache. Data is then served from the cache of other cpu containing the most recently modified copy.

Server Memory Hierarchy

Each cpu has hierarchy of caches: L1, L2, L3 to reduce latency associated with fetching data from physical memory. Physical memory access latency is lower on local numa node then remote node. Cache latency is measured in cycles and physical memory latency is measured in nanoseconds
L1
(cycles)
L2
(cycles)
L3
cache line unshared (cycles)
L3
cache line modified in another core (cycles)
L3
cache line shared in another core (cycles)
L3
cache line in remote socket (cycles)
Local Memory
(ns)
Remote Memory (ns)
4
10
40
65
75
100-300
60
100
If a virtual to physical memory translation (v->p) information is not cached within the DTLBs (Data Translation LookAside Buffer), It requires 1-2 lookup to find the page table location that contains v->p translation and that mean an additional 1 or 2 DRAM latencies (60-120 ns) before data can be fetched from the physical memory. Also, DRAM latencies are for the memory latency alone. True DRAM latency should also include cache miss latency. When an application thread experiences a cache miss, the DRAM request goes into a 16 level deep queue. Thus latency depends on how many other memory requests are still pending. A worst case request could have a latency of up to (3+16)*DRAM latency. 

To improve hit ratio, one should:
  • Reduce TLB miss by using large pages (2 MB, 1GB) or make sure all memory access to occur in adjacent memory addresses. 
  • Use temporary variables that can use cpu registers or optimized into registers. 
  • Coordinate threads/process activities by running them close to each other (in same core or socket). This allow threads to share caches.

FileSystem Cache Latency

When application request memory allocation, Linux uses default memory allocation policy and attempts to allocate memory from local numa node. Also to reduce application file system reads and writes IO latency, page cache memory is also allocated from the local numa node. At some point, local numa node memory may get filled with application and file system pages. When application running in local numa node requests more memory, kernel has two options:
  1. Free filesystem page cache memory in local node, considering page cache memory is counted as free memory
  2. Allocate application memory from the remote node. 
Linux decision is influenced by kernel tunable vm.zone_reclaim_node (Default: 0). Default behavior is to keep filesystem page cache memory intact in local node and honor application memory request from remote node, in case no free memory is available in local node. That keeps filesystem page cache (read/write) latency low, but application may see higher memory latency due to remote memory access.  Setting tunable (vm.zone_reclaim_node=1) enables aggressive page reclamation by evicting filesystem pages from local nodes in favor of application memory pages. This may help with lower application memory latency but at the cost of higher file system cache latency.

Linux NUMA POLICIES

Processor affinity and data placement play important role in application performance. Processor affinity refers to associating a thread/process or a task to a particular processor. Linux, by default, attempts to keep the process on the same core to take advantage of cpu warm cache. Task is considered "cache hot" if it wakes up within a half a millisecond (kernel tunable: sched_migration_cost_ns). Otherwise, it is a candidate for migration to other cores in the physical socket. If all cores are busy in the socket, then task may get pushed to core in remote socket. Running the task on other socket may induce memory access latency due to remote memory access. Linux offers various tools and APIs to overwrite Linux default scheduling behavior and memory allocation policies by binding the task to subset of cpu cores and by allocating memory from a particular numa node.
Linux "taskset" can be used to bind the task to a particular cpu or core. Taskset is, however, not numa aware. Setting processor affinity manually may affect application performance negatively due to cpu contention (more application runnable threads than available cpus) and scheduler restriction not to assign waiting threads to idle or underutilized cores. Also memory access time may also increase when additional allocation cannot be satisfied from the local node. Old memory allocation may not be automatically migrated unless Linux kernel tunable "numa_balancing" is enabled, that map/unmap application pages in an attempt to keep application memory local, if possible. Linux also offers utilities, migratepages, and API routines to move application memory pages manually from one node to another. . See sample program.
Determine processor affinity of a task
$ taskset -p <pid>
Start application with an cpu affinity mask. This will pin the thread to set of cpus specified as a mask
$ taskset -p <cpu mask> <application>
One can also specify cpu number instead. This will pin the application to cpu # 8
$ taskset -c 8 <application>
Change affinity mask of a running task. Process will then restricted to run on list of cpu specified via cpu mask
$ taskset –p  <cpu mask> <pid>


Linux "numactl" is numa aware that allows task placement and memory allocation to a particular numa node. numactl allows system admin to change the default memory allocation policy and specify the one that matches application requirements:
  • Bind:  Memory allocation should come from the supplied numa nodes. 
  • Preferred: Memory allocation preference by specifying list of numa nodes. If first node is full, memory is allocated from the next node
  • Interleave: Memory allocation is interleaved among a set of specified numa nodes. This is useful when amount of memory allocation cannot fit into a single numa node and application is multithreaded/multiprocess. This mode allows memory allocation across multiple numa nodes and balanced memory latency across all application threads.

Run myapp on cpus 2,4,6,8 and allocate memory only to local memory where the process runs.
$numactl -l --physcpubind=2,4,6,8 myapp

Run multithreaded application myapp with its memory interleaved on all CPUs to achieve balanced latency across all application threads.
$numactl --interleave=all myapp

Run process on cpus that are part of node 0 with memory allocated on node 0 and 1.
$numactl --cpunodebind=0 --membind=0,1 myapp
Linux Control group (cgroup): cgroup offers more granular approach to limit application resource usage of a single or a multi-process application. cgroup control application resource usage using resource controllers. Resource controllers are basically kernel drivers/modules that offer finer level control on system resources and thus allows system resources on a larger server to be logically partitioned. This allows multiple competing workloads to share a common kernel with each workload confined to a subset of system resources (cpu, memory, io, network).  For example,  "cpuset" resource controller can help partition numa nodes among workloads and allows dedicated use of cpus and memory on the numa node. 

Application Design

Numa should be thought into application design as NUMA servers are becoming a commonplace due to economy of scale and to control server sprawl in data centers. Even docker containers are typically hosted on large numa servers as large servers allows packaging hundreds or even thousands of containers on a single host. Large servers also have better RAS (Reliability, Availability, Serviceability) and can be configured to achieve higher availability. Few numa considerations are listed below:
  • Reduce TLB miss by using Linux HugePages feature that allocate large pages (2 MB, 1GB) instead of 4k size page. Oracle, MySQL and others offer tunables to use large pages.
  • Oracle and MySQL supports "numa" feature and it should be enabled when hosted on NUMA servers
  • If not using large pages then memory access should occur in adjacent memory addresses to trigger Intel cpu hardware prefetch logic
  • Use temporary variables that can use cpu registers or optimized into registers. 
  • Coordinate threads/process activities by running them close to each other (in same core or socket). This allow threads to share caches
  • Application can also bind threads/processes to cpu or group of cpus (cpu sets) to keep them closer to memory, if desired. Linux has tools like: taskset and cgroup to set thread/process affinity. Linux cgroup offers better features to control which numa nodes process can run or/and allocate memory.
  • For application using master/slave model where slave threads work on independent set of data, programmer should ensure that memory allocations are made by the thread that will subsequently access the data and not by an initialization code. Linux kernel will place memory pages local to the allocating thread, thus local to the worker threads that will access the data.
  • For multi-threaded or multi-process application sharing common but very large pool (mysql shared buffer cache, oracle SGA) of data that may not fit into a single node memory, it is recommended to use Linux memory mode "--interleave=all" (set by numactl or libnuma) to spread the memory across multiple memory nodes to achieve balanced memory latency across all application threads or processes.
  • When application memory allocation size fits into the memory node ("numactl --hardware" can show memory configured in each node) and multiple threads/processes are accessing the same data, better performance is achieved by co-locating threads on the same node. This can be achieved by selecting Linux memory "bind" "–cpunodebind=0, --membind=0" policy via "numactl" command or NUMA API.
  • Some long-lived memory intensive application creates/destroys threads in a threadpool regularly to deal with increase and decrease load. Considering new threads created may not be local, it may result in higher latencies to access shared data. One should monitor memory access latencies periodically and , if needed, migrate pages close to threads if see remote memory access. In general, migration of memory pages from one node to another is an expensive operation, but it may be worthwhile for applications doing memory intensive operations. To migrate application pages manually from one node to another, consider using  "migratepages" command or numa API. Automated system level load balancing ("numa_balancing" featureis also supported by Linux, but it may results in higher cpu kernel overhead (depending on workload). Better option is manually move pages using "migratepages" command or numa API as demonstrated in this program
  • For application that wants to enforce Linux policy (bind, preferred, interleave) early at allocation time should use "numactl --touch" option. Default is apply policy at page fault, when an application accesses a page.

NUMA MONITORING

"numactl" can be used to find number of numa nodes in the system, distance between nodes, cpu and memory association with the node, and Linux policy (default, bind, preferred, interleave) of current process. "numastat" can be used to check per numa node memory allocation statistics. Also process level numa node memory allocation statistics can also be displayed. It is similar information reported in /proc/<pid>/numa_map
For example, Amazon instance x1.32xlarge has four numa nodes. Each with 32 vcpu and 480 MB of memory. Distance between nodes is the same due to mesh topology
$ numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 0 size: 491822 MB
node 0 free: 484600 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 1 size: 491900 MB
node 1 free: 488717 MB
node 2 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 2 size: 491900 MB
node 2 free: 488907 MB
node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 3 size: 491899 MB
node 3 free: 488951 MB
node distances:
node   0   1   2   3
 0:  10  20  20  20
 1:  20  10  20  20
 2:  20  20  10  20
 3:  20  20  20  10
------
Current shell or process memory allocation policy, cpu and memory binding can be listed using:
$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
cpubind: 0 1 2 3
nodebind: 0 1 2 3
membind: 0 1 2 3


numa hit (local memory) and miss (remote memory) statistics
$numastat
                        node0        node1        node2        node3
numa_hit                37703880        43310766        45362620        40991721
numa_miss                      0               0               0               0
numa_foreign                   0               0               0               0
interleave_hit            208782          209249          208783          209259
local_node              35355762        40684734        42747490        38274373
other_node               2348118         2626032         2615130         2717348
Where:
- numa_hit is the number of allocations which were intended for that node and succeeded there.
- numa_miss shows the count of allocations that were intended for this node, but ended up on another node due to memory constraints.
- numa_foreign is the number of allocations that were intended for another node, but ended up on this node. Each numa_foreign event has a numa_miss on another node.
- interleave_hit is the count of interleave policy allocations which were intended for a specific node and were successful.
- local_node is a value that is incremented when a process running on the node allocated memory on the
same node.
- other_node is incremented when a process running on another node allocated memory on that node.

per process numa specific statistics.
$ numastat -p $$
Per-node process memory usage (in MBs) for PID 22028 (bash)
                          Node 0          Node 1          Node 2
                 --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         2.01            0.04            0.09
Stack                        0.02            0.01            0.00
Private                      1.68            0.16            0.04
----------------  --------------- --------------- ---------------
Total                        3.71            0.20            0.13
                          Node 3           Total
                 --------------- ---------------
Huge                         0.00            0.00
Heap                         0.03            2.16
Stack                        0.00            0.03
Private                      0.00            1.89
----------------  --------------- ---------------
Total                        0.03            4.07
 
/proc/<pid>/numa_maps shows information about process memory area allocated from numa nodes.
 
$cat /proc/61623/numa_maps
00400000 default file=/bin/bash mapped=182 mapmax=6 active=158 N0=178 N2=4
006ef000 default file=/bin/bash anon=1 dirty=1 N0=1
006f0000 default file=/bin/bash anon=9 dirty=9 N0=3 N2=2 N3=4
006f9000 default anon=6 dirty=6 N0=3 N2=3
026d1000 default heap anon=553 dirty=553 N0=10 N1=6 N2=270 N3=267
7f1f3b9bd000 default file=/lib/x86_64-linux-gnu/libnss_files-2.19.so mapped=3 mapmax=29 N0=3
7f1f3b9c7000 default file=/lib/x86_64-linux-gnu/libnss_files-2.19.so
..
Where:
- N<node>=<pages>: pages mapped by this process only in various numa nodes. Example N0=178, N2=4. Means 178 pages are allocated on Node0 and 4 pages on Node2.
- file: backing store for the memory area. If pages are generated due to COW pages, then there will anon pages.
- heap: process heap area
- stack: process stack area
- huge: huge pages mapped in process area
- anon<pages>: anonymous pages
- dirty<pages>: modified or dirty pages
- mapped <pages>: mapped pages from the backing store. Only shown if different from the count of anon and dirty pages
- mapmax<count>: Number of processes mapping this page. Sign of sharing of memory area.
- swapcache<count>: Number of pages that have an associated entry on a swap device
- active<pages>: Number of pages in the active list. if count is shown that there may some inactive pages exist in the memory area that may have get removed by swapper soon.
- writeback<pages>: Number of pages that are currently in process of written out.

numa API:
  • getcpu: determine cpu and NUMA node the calling thread is running
  • get_mempolicy: Retrieve numa memory policy for process. Tells which node contains the address
  • set_mempolicy: Sets numa memory policy for process (default, preferred, interleave)
  • move_pages: move individual pages of a process to other node
  • mbind: Sets memory policy (bind mode) for particular memory range or node.

Intel PMU (Performance Monitoring Unit): 
Each core in Intel processor has its own PMU (Performance Monitor Unit) that gives wealth of statistics about cache and memory latency and throughput. Linux perf can be used to capture events like: cpu-clock, cycles, task-clock (wall clock time). Also CPI (Clock per Instruction) or IPC (Instruction per  cycle) can be used to estimate physical memory fetch latencies. Intel PCM Library and tools (pcm.xpcm-numa.x, pcm.mem.x) can provide additional information about off core (uncore) statistics like: shared L3 cache stats, Memory channels and QPI utilization and throughput, numa and memory latency and bandwidth. Uncore has its own PMU for monitoring these activities. 
Linux perf:
perf stat -a -A
perf stat -e cpu-clock,cycles <pid>

Intel PCM tools sample output
Sample output of pcm-numa.x
Core | IPC  | Instructions | Cycles  |  Local DRAM accesses | Remote DRAM Accesses
  0   1.06        359 G      340 G      3096 M              1151 M             
  1   1.06        359 G      339 G      3589 M               653 M             
  2   1.06        359 G      339 G      3459 M               787 M             
  3   1.06        360 G      339 G      3602 M               652 M             
  4   1.06        359 G      338 G      3435 M               808 M             
  5   1.06        359 G      339 G      3629 M               614 M             
  6   1.05        359 G      341 G      3080 M              1164 M             
  7   1.06        360 G      339 G      3132 M              1117 M             
  8   1.05        359 G      341 G      3237 M              1007 M             
  9   1.05        360 G      342 G      3565 M               693 M             
 10   1.05        359 G      341 G      3585 M               660 M   
....
sample output of pmc-memory.x
---------------------------------------||---------------------------------------
--          Socket 0              --||--          Socket 1              --
---------------------------------------||---------------------------------------
---------------------------------------||---------------------------------------
---------------------------------------||---------------------------------------
--   Memory Performance Monitoring   --||--   Memory Performance Monitoring   --
---------------------------------------||---------------------------------------
--  Mem Ch 0: Reads (MB/s): 6870.81  --||--  Mem Ch 0: Reads (MB/s): 7406.36  --
--         Writes(MB/s): 1805.03  --||--         Writes(MB/s): 1951.25  --
--  Mem Ch 1: Reads (MB/s): 6873.91  --||--  Mem Ch 1: Reads (MB/s): 7411.11  --
--         Writes(MB/s): 1810.86  --||--         Writes(MB/s): 1957.73  --
--  Mem Ch 2: Reads (MB/s): 6866.77  --||--  Mem Ch 2: Reads (MB/s): 7403.39  --
--         Writes(MB/s): 1804.38  --||--         Writes(MB/s): 1951.42  --
--  Mem Ch 3: Reads (MB/s): 6867.47  --||--  Mem Ch 3: Reads (MB/s): 7403.66  --
--         Writes(MB/s): 1805.53  --||--         Writes(MB/s): 1950.95  --
-- NODE0 Mem Read (MB/s):  27478.96  --||-- NODE1 Mem Read (MB/s):  29624.51  --
-- NODE0 Mem Write (MB/s):  7225.79  --||-- NODE1 Mem Write (MB/s):  7811.36  --
-- NODE0 P. Write (T/s) :    214810  --||-- NODE1 P. Write (T/s):     238294  --
-- NODE0 Memory (MB/s):    34704.75  --||-- NODE1 Memory (MB/s):    37435.87  --
---------------------------------------||---------------------------------------
--                System Read Throughput(MB/s):  57103.47                  --
--               System Write Throughput(MB/s):  15037.15                  --
--              System Memory Throughput(MB/s):  72140.62                  --
---------------------------------------||---------------------------------------

NUMA TESTING

For memory access, two factors determines application performance: Latency and bandwidth. Latency is the time required for the application to fetch data from the processor’s cache hierarchy (L1-L3) and from the physical memory located on local or remote numa nodes. Besides latency, memory controller bandwidth also plays an important role in how fast data can be fed to cpu. Thus measuring memory latencies and throughput are important metrics to establish a baseline for the system under test.

Amazon Numa Servers

Instane
L1
L2
L3
node0
node 1
node 2
node 3
x1.32xl
32K
256K
45M
cpu:32
mem:480G
cpu:32
mem:480G
cpu:32
mem:480G
cpu:32
mem:480G
r3.8xl
32K
256K
24M
cpu:16
mem:120G
cpu:16
mem:120G
x
x
m4.10xl
32K
256K
30M
cpu:20
mem:80G
cpu:20
mem: 80G
x
x
i2.8xl
32K
256K
24M
cpu:16
mem:120G
cpu:16
mem:120G
x
x

Memory Latency Tests

lat_mem_rd test of lmbench suite is used to measure memory latency. lat_mem_rd test requires array and stride size as arguments. Use a large array size (2-4 GB) and a stride size that matches cpu cache line size (Intel cache line size: 64 bytes). In order to test all levels of memory (L1-L3, RAM), one should pick an array size large enough not to fit into L1-L3 caches. Test focuses on memory read (load) latency. There is no memory write (store) operations issued during test. Since no data is modified during test, there is no write back penalty (invalidating cache line and flushing to memory). Also latency reported only for data access not instruction access. Test runs has two nested loops that traverses through the array with stride size increment. For each array size, the benchmark creates a ring of pointers that point backward one stride. Traversing the array is done by:  p = (char **)*p
Latency is reported in nanoseconds (ns) unit for the range of memory sizes. For memory sizes that fits into L1-L3 caches are counted as cache latency and larger sizes represent memory latency. Results are reported in two columns: First column has array size MB and second column load latency (ns) over all the points of the array.  Graphing result will show relative latencies of the entire memory hierarchy.
Since we are interested in memory latency, we will ignore latencies reported for memory areas below 50 MB, since working set below 50 MB may fit into cpu caches (L1-L3)
To install, type:
$ sudo apt-get update ; sudo apt-get -y install lmbench
Binaries will be installed in Dir: /usr/lib/lmbench/bin/x86_64-linux-gnu
Memory Latency with no external load
Instance Type
Local Latency
Remote Latency
Interleave across all Nodes Latency
x1.32xl
4 numa nodes
14 ns
node1: 21 ns, node2: 23 ns
node3: 21 ns
20 ns
r3.8xlarge
2 numa nodes
8 ns
11 ns
10 ns
m4.10xl
2 numa nodes
10 ns
12 ns
11 ns
i2.8xlarge
2 numa nodes
8 ns
11 ns
10 ns
Local Latency Test
Thread is pinned to a single cpu. Using Linux default memory allocation policy.
$sudo taskset -c 8 ./lat_mem_rd 4000 64
Thread is pinned to cpu 8 and memory is allocated from the same node (node 0). Linux Bind memory allocation policy.
$numactl --physcpubind=8 --membind=0 ./lat_mem_rd 2000 64
Remote Latency Test
Thread is pinned to cpu 8 and memory is allocated from remote nodes only (node1, node2, node3)
$numactl --physcpubind=8 --membind=1 ./lat_mem_rd 2000 64
$numactl --physcpubind=8 --membind=2 ./lat_mem_rd 2000 64
$numactl --physcpubind=8 --membind=3 ./lat_mem_rd 2000 64
Interleave Latency Test
Thread is pinned to cpu 8 and memory is allocated across all nodes.

$numactl --physcpubind=8 --interleave=all ./lat_mem_rd 2000 64
Numa Latency with external load
To induce external load, memory latency test was run with multiple STREAM tests running in the background. For local latency tests, 7 STREAM tests were generating load across the same numa node. For remote latency test, 7 STREAM tests were generating load across QPI link. cpu that was running latency test was spared from running STREAM tests.
Instance Type
Local Latency
Remote Latency
Interleave across all Nodes Latency
x1.32xl
4 numa nodes
40 ns
107 ns
43 ns
m4.10xl
2 numa nodes
30 ns
50 ns
42 ns
r3.8xl
2 numa nodes
TBD
TBD
TBD
i2.8xl
2 numa nodes
31 ns
36 ns
33 ns

Memory Bandwidth Tests

STREAM benchmark is used to measure NUMA server memory bandwidth. NUMA servers are used for throughput computing where Memory bandwidth play a critical role to achieve higher throughput. As the gap between processor and memory speeds widens, and memory density increases, application performance is most likely be affected by the memory bandwidth of the system than the compute (cpu) capabilities. The STREAM benchmark is designed to simulate performance of a very large vector style application that works with datasets much larger than the available cpu cache (L1,L2,L3). General rule of thumb is that each array should be at least 4x the size of the sum of all cpu caches. The amount of memory consumed by a single array is controlled by STREAM_ARRAY_SIZE multiplied by 8 bytes, the size of a single array element. Thus to find number of elements in array, just divide the total memory by 8 bytes. For example, a 1 GB array would require a number of elements equal to 1024 * 1024 * 1024 / 8 or 134217728 elements. To specify a 4 GB array, you would multiply this value by 4 to get 536870912 elements. The total amount of memory consumed during STREAM benchmark will actually be 3 times this value since the benchmark allocates 3 arrays. STEAM code is written such a way that data re-use (cache hit) is not possible. STREAMS monitors and analyze sustainable data transfer rates for uncached unit-stride vector operations and reports memory bandwidth in MBps.
Set of memory operations performed during STREAM benchmarks
Operation
Detail
COPY
measure only memory transfer rates. No arithemtic is performed. 16 bytes are counted in each iteration of STREAM loops.
SCALE
Adds simple arithmetic operations to copy
SUM
Adds multiple memory load/store (read/write) on vector to be tested
TRIAD
Each iteration performs two reads from memory (for “b(i)” and “q*c(i)”) and then one write (writing the result, a(i) to memory) resulting in a total of 24 bytes of data being transferred. The memory bandwidth is then calculated by how many iterations can be accomplished in a given amount of time. 1 million iterations in 1 second would result in a score of 24MB/s.
See STREAM Top 20 Results on various platforms. Memory Throughput Top 20 Results are also available.
Download the STREAM source and compile it:
$ gcc -O2 -c mysecond.c
$gcc -O2 -mcmodel=medium -DSTREAM_ARRAY_SIZE=536870912 stream.c -o stream.4G # Sets Array Size to 4GB

Amazon Numa Instances Memory Bandwidth
Instance Type
Local Bandwidth
Remote Bandwidth
x1.32xl
4 nodes
46 GBps
13 GBps
i2..8xl
2 nodes
28 GBps
24 GBps
m4.10xl
2 nodes
29 GBps
16 GBps

Local Memory Bandwidth
Test: $numactl --physcpubind=8 --membind=0 ./stream.4G
x1.32xl
Maximum Memory bandwidth (bw) per socket is 102 GBps. Single cpu STREAM Test able to achieve bw of 8-9 GBps (8-9% of theoretical bw). Running 16 STREAM test concurrently across 16 cores (1 thread/core) on the same physical CPU able to push bw to 45-46 GBps (45% of theoretical bw). Increasing number of STREAM tests to 32 (2 threads / core) on the physical CPU did not improve throughput that peaked at 46 GBps.
i2.8xl
Single cpu STREAM Test able to achieve bw of 11-12 GBps . Running 8 STREAM test concurrently across 8 cores (1 thread/core) on the same physical CPU able to push bw to 27-28 GBps. Increasing number of STREAM tests to 16 (2 threads / core) on the same physical CPU did not improve throughput that peaked at 28 GBps.
m4.10xl
Single cpu STREAM Test able to achieve bw of 11-12 GBps. Running 10 STREAM test concurrently across 10 cores (1 thread/core) on the same physical CPU was able to push bw to 28-29 GBps. Increasing number of STREAM tests to 20 (2 threads / core) on the same physical CPU did not improve throughput that peaked at 29 GBps.

Remote Memory Bandwidth
Test: $numactl --physcpubind=8 --membind=0 ./stream.4G
x1.32xl
Maximum QPI bw per socket is 76 GBps. Single cpu STREAM Test able to achieve 6-7 GBps (8-9% of theoretical QPI bw). Running 16 STREAM test concurrently across 16 cores (1 thread/core) on the same physical cpu able to push bw to 13 GBps across QPI link ( 17% of theoretical QPI bw). Increasing the concurrency to 32 STREAMS (2 threads/core) on the same physical cpu did not improve remote memory throughput and as it peaked out at 13 GBps
i2.8xl
Single cpu STREAM Test able to achieve 8-9 GBps. Running 8 STREAM tests concurrently across 8 cores (1 thread/core) on the same physical cpu able to push bw to 24 GBps. Increasing the concurrency to 16 STREAMS (2 threads/core) on the same physical cpu did not improve remote memory throughput as it peaked out at 24 GBps.
m4.10xl
Single cpu STREAM Test able to achieve 8-9 GBps. Running 10 STREAM test concurrently (1 threads/core) on the same physical cpu able to push bw to 16 GBps across QPI link. Increasing the concurrency to 20 STREAMS (2 threads/core) on the same physical cpu did not improve remote memory throughput as it peaked out at 16 GBps.
Interleave Memory Bandwidth
Test: numactl --physcpubind=8 --interleave=all ./stream.4G
Memory bandwidth measured sits in the middle (balanced) of local and remote memory bandwidh.
Some Consideration about STREAM Memory Bandwidth Tests
  • STREAM benchmark always counts only the bytes that user program requested to be loaded or stored, to keep the results comparable. If additional data is transferred to/from memory by the platform, besides what user requested, the additional data transferred is not counted in the memory bandwidth result calculated by the benchmark. Thus the memory bandwidth of the platform may actually be higher than what the STREAM benchmark reports.
  • Cache coherency protocol (required for SMP servers) will not allow the cpu to write a cache line to physical memory without first reading it. The read is done before the write, to ensure no one else has a copy of that particular cache line and the processor writing the cache line has the ownership of the line. When running the STREAM benchmark, if the processor must first read the cache line before writing it, you will effectively be doing three reads, and one write to memory. However, only two reads, and one write will be counted towards STREAM bandwidth score. Where three reads and one write must be done to abide with cache coherency, 25% of the available memory bandwidth would not actually be counted by the STREAM benchmark. The extra read does use available memory bandwidth, but not counted by the benchmark. Thus reported STREAM bandwidth score could be ~25% below what the platform is actually transferring to/from memory. 
  • Memory bandwidth measurements reported may differ due to several factors: DIMM speeds and interleaving, bus speeds, memcopy optimizations, number of memory channels, QPI link throughput (NUMA), L1-L3 caches, local memory, remote memory, hardware and software prefetching, and DMA operations by IO devices during memory access, etc.
Memory Interleaving
 
Memory interleaving is a hardware feature that increases cpu to memory bandwidth by allowing parallel access to multiple memory banks. Memory interleaving refers to how physical memory is interleaved across the physical DIMMs. This feature is highly effective for reading large volumes of data in succession. Memory interleaving increase bandwidth by allowing simultaneous access to more than one bank of memory. Enterprise servers supports interleaving because it boosts memory access performance via parallel access to multiple on-boards memory units. Interleaving works by dividing the system memory into multiple blocks of two or four, called 2-way or 4-way interleaving. Each block of memory is accessed by different sets of control lines that are merged together on the memory bus. Thus read/write of blocks of memory can be overlapped. Consecutive memory addresses are spread over the different blocks of memory to exploit interleaving. A balanced system provides the best interleaving. Intel Xeon server is considered balanced when all memory channels on a socket have the equal amount of memory. Goal should be be to populate memory evenly across all sockets to improve throughput and reduce latency. 

Note: memory Interleaving may not visible on cloud instance or system running under hypervisor control.

Dual or Quad Rank Memory Modules

The term "Rank" means 64-bit chunk of data. Number of ranks means number of independent DRAM sets in a DIMM that can be accessed for the 64 data bit, the width of the DIMMMemory DIMMS with a single 64-bit chunk of data is called single-rank module. Nowadays, quad-rank modules are common considering higher rank allows denser DIMMs.  
Clocking memory at a higher frequency also help improve memory throughput. For example, performance gain of using 1066MHz memory versus 800MHz memory is 28% and 1333MHz vs 1066MHz is 9%. DDR4 Memory specification supports higher frequency than DDR3. DDR4 can support transaction rates of 2133 - 4266 MTps (Million Transaction per second) as compared to DDR3 that is limited to 800 - 2133 MTps. DDR4 also uses less power than DDR3. 

References

3 comments:

  1. Good article.
    W.r.t "Linux decision is influenced by kernel tunable vm.zone_reclaim_node (Default: 0)"
    Is it zone_reclaim_node or zone_reclaim_mode

    ReplyDelete
  2. Exploring NUMA on Amazon Cloud Instances plays a huge role. This is especially important for students. It is also important for students to know how to write an article. In this article what an article https://wr1ter.com/how-to-write-an-article you will find specific guidance for writing a quality article.

    ReplyDelete