Monday, July 18, 2016

How Linux Kernel Manages Application Memory

Linux uses Virtual Memory (VM) that acts as a logical layer between application memory requests and physical memory (RAM). VM abstraction hides the complexity of platform specific physical memory implementation from the application. When application accesses virtual addresses exported by VM, hardware MMU raises an event to tell the kernel that an access has occurred to an area of memory that does not have physical memory mapped to it. This event results in an exception, called Page Fault, that is serviced by Linux kernel by mapping a faulted virtual address to physical memory page.
Virtual to Physical Page Translation

A page is simply a group of contiguous linear addresses in physical memory. Page size is 4 KB on x86 platform. Virtual addresses are transparently mapped to physical memory by collaboration of hardware ( MMU, Memory Management Unit) and software ( Page Tables). Virtual to physical mapping information is also cached in hardware, called TLB (Translation Lookaside Buffer), for later reference to allow quick lookup into physical memory locations.

Virtual to physical memory mapping
VM abstraction offers several benefits:
  • Programmers do not need to know physical memory architecture of the platform. VM hides it and allows writing architecture independent code. 
  • Process always see linear contiguous range of bytes in its address space, regardless of how fragmented the physical memory. 
    • For example: when application allocates 10 MB of memory, Linux kernel reserves 10 MB of contiguous virtual address range in the process address space. Physical memory locations where these virtual address range is mapped may not be contiguous. Only part that is guaranteed to be contiguous in the physical memory is the size of the page (4 KB).
  • Faster startup due to partial loading. Demand paging loads instructions as they are referenced.
  • Memory sharing. A single copy of library/program in physical memory can be mapped to multiple process address space. Allows efficient use of physical memory. "pmap -X <pid>" can be used to find what process resident memory is shared by other process or private.
  • Several programs with memory footprints bigger than physical memory can run concurrently. Kernel behind the scene relocates least recently accessed pages to disk (swap) transparently. 
  • Processes are isolated into its own virtual address spaces and thus cannot affect or corrupt other process memory.
Two processes may use same virtual addresses, but these virtual addresses are mapped to different physical memory locations. Processes that attach to same shared memory (SHM) segment will have their virtual addresses mapped to same physical memory location.
getfile
Process address space can span to 32-bit or 64-bit. 32-bit address space is limited to 4GB, as compared to hundreds of Terabytes for 64-bit address space. Size of process address space limits the amount of physical memory application can use.
Process virtual address space is composed of memory segments of type: Text, Data, Heap, Stack, Shared (SHM) memory and mmap. Process address space is defined as the range of virtual memory addresses that are exported to processes as its environment. Process address map can be viewed using "pmap -X <pid>". 
various memory segments that are part of process address space
Each memory segment is composed of linear virtual address range with starting and ending addresses, and are backed by some backing store like: filesystem or swap. Page fault is serviced by filling physical memory page from the backing store. Also, during memory shortages, data cached in physical memory pages is migrated to its backing store. Process "Text" memory segments is backed by executable file on the file system. Stack, heap, COW (Copy-on-Write) and shared memory pages are called anonymous (Anon) pages and are backed up by swap (disk partition or file). When swap is not configured, anonymous pages cannot be freed and are thus locked into memory considering no place to migrate data from these physical pages during memory shortages.
When process calls malloc() or sbrk(), kernel creates a new heap segment in the process address space and reserves the range of process virtual addresses that can be accessed legally. Any reference to a virtual address outside of reserved address range results in a segmentation violation, that kills the process. Physical memory allocation is delayed until process accesses the virtual addresses within the newly created memory segment. That means, application performing large 50GB of malloc and touching (page faulting) only 10 MB range of virtual addresses will consume only 10 MB of physical memory. One can view physical and virtual memory allocation per process using "ps", "pidstat" or "top" (Where: SIZE represents size of virtual memory segment and RSS represents allocated physical memory). Also, "pmap -X <pid>" can be used for detail view of type of process level memory allocation.
Physical memory pages used for program Text and caching file system data (called page cache) can be freed quickly during memory shortages considering data can always be retrieved from the backing store (file system). However, to free anonymous pages, data needs to be written to swap device before it can be freed.
Anonymous memory segments (heap, stack, cow, shared memory) are backed by swap (Disk)

Linux Memory Allocation Policy

Process memory allocation is controlled by Linux memory allocation policy. Linux offers three different modes of memory allocations depending on the value set for tunable, vm.overcommit_memory
  • Heuristic overcommit (vm.overcommit_memory=0): Linux default mode allows processes to overcommit "reasonable" amount of memory as determined by internal heuristics, that takes into account: free memory and free swap. In addition to this, memory that can be freed by shrinking the file system cache and kernel slab caches (used by kernel drivers and subsystems) is also taken into consideration.
    • Pros: Uses relaxed accounting rules and it is useful for program that typically requests more memory than actually uses. As long as, there is a sufficient free memory and/or swap available to meet the request, process continue to function.
    • Cons: Linux kernel makes no attempt to reserve physical memory on behalf of process, unless process touches (access) all virtual addresses in the memory segment.   
      • Example, Let say application, myapp, allocates 50 GB of memory, but touches only 10 GB.  40 GB of physical memory not touched by myapp is available for other applications. If any other application(s) or malicious program touches all available free memory before "myapp" get to touch it, it could trigger OOM (Out Of Memory) Killer that may terminate "myapp" in a desperate attempt to find candidates that can be killed to free memory.
  • Always overcommit (vm.overcommit_memory=1): Allows process to overcommit as much memory as it wants and it always succeed. 
    • Pros: Wild allocations are permitted considering no restrictions on free memory or swap. 
    • Cons: Same as Heuristic overcommit. Application can malloc() TBs on a system with few GBs of physical memory. No failure until all pages are touched and that triggers OOM Killer.
  • Strict Overcommit (vm.overcommit_memory=2): Prevents overcommit by reserving both virtual memory range and physical memory. No overcommit means no OOM Killer. Kernel keeps track of amount of physical memory reserved or already committed. "cat /proc/meminfo" reports metrics such as: CommitLimit, Committed_AS to help estimate memory available for allocation. Since strict overcommit mode does not take free memory and swap into consideration, one should not use free memory or swap metrics (reported by free, vmstat ) to discover memory available. To calculate current overcommit or allocation limit, one should use the equation: "CommitLimit - Committed_AS". Kernel tunable "vm.overcommit_ratio" sets overcommit limit for this mode. Overcommit limit is set to: Physical Memory x overcommit_ratio + swap. Overcommit limit can be raised by setting vm.overcommit_ratio tunable to a bigger value (default 50% of physical memory). 
    • Pros: Disables OOM Killer. Failure at the startup has lower production impact than being killed by OOM Killer while serving production load. Solaris OS offers only this mode. Strict overcommit does not use free memory/swap for overcommit limit calculations.
    • Cons: No overcommit allowed. Memory allocated but not used by application may not be used by other application. A new program may fail to allocate memory even when the system is reporting plenty of free memory. This is due to reservation against the physical memory on behalf of existing processes. Monitoring for free memory becomes tricky. Some badly written applications do not handle memory allocation failures. Inability to check memory failures may results in corrupted memory and random hard to debug failures. 
      • Note: Memory not used by the application can still be used for filesystem cache considering page cache memory can be freed when application needs it.
NOTE: For both heuristic and strict overcommit, the kernel reserves a certain amount of memory for root. In heuristic mode, 1/32nd of the free physical memory. In Strict overcommit mode it is 1/32nd of the percent of real memory that you set. This is hard coded in kernel and cannot be tuned. That means a system with 64GB will reserve 2GB for root user.

What causes OOM Killer


When system level memory shortages reaches to an extreme situation where filesystem cache has been shrunk, all possible memory pages has been reclaimed, but memory demand continue to stay high that ultimately exhausts all the available memory. To deal with such situation, kernel selects processes that can be killed to free memory. This desperate kernel action is called OOM Killer.
Criteria used to find the candidate process some time kills the most critical process. There are few options available to deal with OOM Killer:
  • Disable OOM Killer by changing kernel memory allocation policy to strict overcommit.
    • $sudo sysctl vm.overcommit_memory=2
    • $sudo sysctl vm.overcommit_ratio=80
  • Opt out the critical process from OOM Killer consideration.
    • $ echo -17 > /proc/<pid-critical-process>/oom_adj
  • Opting out critical server process may sometime not be enough to keep system functioning. Kernel still has to kill processes in order to free memory. In some cases, automated reboot server to deal with OOM Killer may the better option.
    • $sudo sysctl vm.panic_on_oom=1
    • $sudo sysctl kernel.panic="number_of_seconds_to_wait_before_reboot"

File System Cache Benefits

Linux uses free memory that is not being used by application for caching file system pages and disk blocks. Memory used by file system cache is counted as free memory and available when needed (after writing modified pages to backing store or disk). Linux "free" reports file system cache memory as free memory. Benefit of having file system cache is improved performance of application file system reads and writes:
  • Read: When application reads from a file, kernel performs a physical IO to read data blocks from the disk. Data is cached in the file system cache for later use to avoid physical read. When application requests the same block, it only requires a logical IO (reading from filesystem page cache) and that improves application performance. Also, file systems prefetch (read ahead) blocks, when sequential IO pattern is detected, in an anticipation that application will request next adjacent blocks. This also help reduce IO latencies.
  • Write: When application writes to a file, kernel caches data into page cache and acknowledges completions (called buffer writes). Also file data sitting in filesystem cache can be updated multiple times (called write cancelling) in memory before kernel schedules dirty pages to be written to disk. 
    File System cache improves both read and write performance

    Dirty pages in file system cache are written by "flusher" (old name is pdflush) kernel thread. Dirty pages are flushed periodically when the proportion of dirty buffers in memory exceeds a certain threshold (kernel tunable). File system cache improves application IO performance by hiding storage latencies.

HugeTLB or HugePages Benefits

TLB miss results in walk to memory resident page tables

As discussed earlier,TLB (Translation Lookaside buffer), integrated onto a cpu chip, caches virtual to physical translation. When translation is not found in TLB (event is called TLB miss), it results in expensive walk to memory resident page tables to find virtual to physical memory translation. TLB cache hit is becoming more important due to increasing disparity in cpu and memory speed and memory density. Frequent TLB miss may negatively impact application performance. TLB is a scarce resource on cpu chip and Linux kernel tries to make best use of limited TLB cache entries. Each TLB cache entry can be programmed to provide access to contiguous physical memory addresses of various sizes: 4 KB, 2 MB or 1 GB.  Linux HugeTLB feature allows application to use large pages: 2 MB, 1 GB than the default 4 KB size. 

Intel Haswell core has 64 entries for caching 4 KB page translation, 32 entries for 2 MB and 4 entries for 1 GB pages in L1 DTLB. There is also a unified (shared) L2 TLB that can hold translations for 1024 4 KB or 2 MB pages. Once the virtual address has been calculated, processor probes the TLB cache for v->p translation and then fetches the data in 64 bytes chunk from the physical memory location into L1/L2 hardware caches

Pros and Cons of Linux HugeTLB feature:

Pros:

  • HugeTLB may help reduce TLB misses by covering bigger process address space. For Intel Haswell processor:
    • 4 KB page can cover: 64x4 + 1024x4 = 4 MB
    • 2 MB page can cover: 32x2048 +1024x2048 = 2 GB
    • 1 GB page can cover: 4GB
  • TLB miss with HugeTLB is cheaper to service. Virtual to physical memory translation for 4KB pages via page tables require multiple levels of translations (4 levels for standard 48-bit virtual address space). Larger page size require fewer page table entries and levels are shallower. This reduces memory latency due to 2 level instead of 4 level page tables access and physical memory used for page table translation. 
  • Reduces page fault rates. Each page fault can fill 2 MB or 1GB physical memory than 4 KB. Thus makes the application to warm up much faster.
  • Application performance improvement with HugeTLB depends on application access pattern. If application access pattern shows data locality, HugeTLB will help. However, if application reads from random locations or only few bytes from each page (large hash table lookup) and the working set is too big to fit in TLB cache, then 4 KB page size may offer better performance. 
  • 1 GB page may offer best performance when working set fits in 4GB physical memory. Even when the working set is bigger, page table walk with 1GB will be much quicker.
  • Huge Pages are locked in memory and thus are not candidate for page out during memory shortages
  • Large pages also improve the process of memory pre-fetching by eliminating the need to restart pre-fetch operation at 4K boundaries
  • Transparent HugePages benchmarks results showing remarkable improvment
Cons:
  • Huge Pages require upfront reservation. System Admin is required to set kernel tunable to desired number of HugePages: vm.nr_hugepages=<number_of_pages>
    • Linux Transparent Huge Pages (THP) feature does not have upfront cost. THP is still new and has limited uses and known performance bugs. More THP testing is needed!
  • Application should be HugePage aware. For example: java application should be started with "-XX=+UseLargePages" option in order to use large pages for java heap. Otherwise, pages allocated may not be used for any purpose. One can monitor Huge Page usage using "cat /proc/meminfo|grep PageTables" 
  • HugePages require contiguous physical memory of sizes: 2 MB and 1GB. Request for large pages may fail if the system is running for a longer period and most of the memory is demoted to 4 KB chunks. 

18 comments:

  1. Nice blog. The article you have shared is good.This is very useful. My friend suggest me to use this blog. I am working at essay writing service reviews. Thank you for sharing.

    ReplyDelete
  2. But because, significantly, a growing number of of our daily life gets associated with the net, the quantity of information kept online expands rapidly and this raises a concern regarding the dependability of the cloud storage space.
    cloud review from joe

    ReplyDelete
  3. Amazing Article, thank you!. I just wish to give you a big thumbs up for the excellent post. Kindly keep updating your blog. Java Developer is a dream career for IT students.To start wonderful Career to become a Java developer learn from Java Training in Chennai. or learn thru Java Online Training from India .

    or Javascript Training in Chennai. Nowadays JavaScript has tons of job opportunities on various vertical industry.

    ReplyDelete
  4. Thanks for the command of this topic, thanks to that I got here. Cool thing.

    June

    ReplyDelete
  5. Hmm, it seems like your site ate my first comment (it was extremely long) so I guess I’ll just sum it up what I had written and say, I’m thoroughly enjoying your blog. I as well as an aspiring blog writer, but I’m still new to the whole thing. Do you have any recommendations for newbie blog writers? I’d appreciate it.
    Advanced AWS Training in Bangalore | Best Amazon Web Services Training Institute in Bangalore
    Advanced AWS Training Institute in Pune | Best Amazon Web Services Training Institute in Pune
    Advanced AWS Online Training Institute in india | Best Online AWS Certification Course in india
    AWS training in bangalore | Best aws training in bangalore

    ReplyDelete
  6. Superb. I really enjoyed very much with this article here. Really it is an amazing article I had ever read. I hope it will help a lot for all. Thank you so much for this amazing posts and please keep update like this excellent article. thank you for sharing such a great blog with us.
    microsoft azure training in bangalore
    rpa training in bangalore
    rpa training in pune
    best rpa training in bangalore

    ReplyDelete
  7. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    Best Devops training in sholinganallur
    Devops training in velachery
    Devops training in annanagar
    Devops training in tambaram

    ReplyDelete
  8. Your good knowledge and kindness in playing with all the pieces were very useful. I don’t know what I would have done if I had not encountered such a step like this.
    python training in chennai
    python course institute in chennai

    ReplyDelete
  9. Good Post! , it was so good to read and useful to improve my knowledge as an updated one, keep blogging.After seeing your article I want to say that also a well-written article with some very good information which is very useful for the readers....thanks for sharing it and do share more posts likethis. https://www.3ritechnologies.com/course/data-science-online-training/

    ReplyDelete
  10. Crypto-currency as a modern form of the digital asset has received a worldwide acclaim for easy and faster financial transactions and its awareness among people have allowed them to take more interest in the field thus opening up new and advanced ways of making payments. Crypto.com Referral Code with the growing demand of this global phenomenon more,new traders and business owners are now willing to invest in this currency platform despite its fluctuating prices however it is quite difficult to choose the best one when the market is full. In the list of crypto-currencies bit-coins is one of the oldest and more popular Crypto.com Referral Code for the last few years. It is basically used for trading goods and services and has become the part of the so-called computerized block-chain system allowing anyone to use it thus increasing the craze among the public, Crypto.com Referral Code.

    Common people who are willing to purchase BTC can use an online wallet system for buying them safely in exchange of cash or credit cards and in a comfortable way from the thousands of BTC foundations around the world and keep them as assets for the future. Due to its popularity, many corporate investors are now accepting them as cross-border payments and the rise is unstoppable. With the advent of the internet and mobile devices,information gathering has become quite easy as a result the BTC financial transactions are accessible and its price is set in accordance with people’s choice and preferences thus leading to a profitable investment with Crypto.com Referral Code Code. Recent surveys have also proved that instability is good for BTC exchange as if there is instability and political unrest in the country due to which banks suffer then investing in BTC can surely be a better option. Again bit-coin transaction fees are pretty cheaper and a more convenient technology for making contracts thus attracting the crowd. The BTC can also be converted into different fiat currencies and is used for trading of securities, for land titles, document stamping, public rewards and vice versa.

    Another advanced block-chain project is Ethereumor the ETH which has served much more than just a digital form of crypto-currency Crypto.com Referral Code and its popularity in the last few decades have allowed billions of people to hold wallets for them. With the ease of the online world,the ETH have allowed the retailers and business organizations to accept them for trading purposes, therefore, can serve as the future of the financial system.

    ReplyDelete
  11. Our full Lace Front Wigs are all hand made with a lace cap. They are manufactured with thin lace sewn on top of the cap. Individual hairs are then sewn onto the thin lace. Each lace wig has lace all around the unit which will need to be cut prior to securing the wig to your head. You will need to cut along the hairline around your entire head. By doing so, you will be able to wear your hair anyway you like. You can even style ponytails, up-dos, etc. Once the Lace Wigs is successfully applied, it will appear that all the hair is growing directly from your head!

    Lace front wigs are hand-made with lace front cap & machine weft at back. Lace front wigs are manufactured with a thin lace that extends from ear to ear across the hairline. When you receive the wig, the lace will be quite long in the front. Cut and style according to your preference, as you will need to apply adhesive along the front of the wig. Once the wig is applied, you will still have Lace Wigs with a very natural appearance.
    TeamWigz Provide the Best Lace Front Wigs and Lace Wigs in Johannesburg and South Africa.

    ReplyDelete
  12. Purchase EMAIL LEADS

    On the off chance that your organization is in the post for speculators or new business tries, business opportunity leads give you data of those needing to begin a business. They are an email away from being a purchaser or your business partner.Time is of the quintessence in each advertising methodology. purchase email lists Regular postal mail, for instance, takes an all-inclusive time from the mission time frame to usage. Not everything organizations can bear to invest such measure of energy to get advertising results. This might be one reason why email showcasing got famous. On account of modernization and high innovation, with your email contact show, you can reach and pass on your message to a huge number of individuals with only a couple clicks away.

    ReplyDelete
  13. Tongkat Ali ist eine Pflanze beheimatet in Südostasien. Sie gehört zur Gattung der Bittereschengewächse und Ihr botanischer Name lautet “Eurycoma longifolia”.
    Maca Kapseln Es gibt noch eine weitere Reihe länderspezifischer Namen
    wie z. B. “Pasak Bumi”, wie die Pflanze in Indonesien genannt wird oder “longjack”, die allgemeine Bezeichnung für Tongkat Ali Kaufen in den USA, Kanada und Australien.

    Das Ursprungsland von Tongkat Ali Kaufen ist Indonesien, daher findet man auch dort auch die größten Bestände. Weitere Vorkommen gibt es in Ländern wie Thailand, Malaysia, Vietnam und Laos.

    Die Einnahme von Tongkat Ali Kaufen empfiehlt sich insbesondere für Leistungssportler, die einen schnellen
    Muskelaufbau und Muskelzuwachs anstreben und nicht auf illegale und künstliche Substanzen zurückgreifen möchten um Ihren Testosteronspiegel zu erhöhen.

    Generell empfiehlt sich eine Einnahme von Tongkat Ali für alle Männer ab dem 30ten Lebensjahr, da in dieser Phase nachweislich die Produktion von körpereigenem Testosteron zurückgeht. Dies macht sich vor allem dadurch bemerkbar dass die körperliche Leistungsfähigkeit nachlässt, die Lust auf Sex spürbar abnimmt und dadurch allgemein das Selbstwertgefühl negativ beeinflusst wird.

    Mit der Einnahme von Tongkat Ali können Sie nachweislich Ihre Libido steigern, Ihr Testosteron erhöhen und Ihre gewohnte Lebensenergie aus den jungen Jahren Ihres Lebens wieder herbeiführen. Hier können Sie übrigens weitere Informationen bekommen zum Thema ‘Libido steigern‘ ganz natürlich. Sollten Sie daran interessiert sein lohnt es sich auch unseren Artikel über Butea Superba zu lesen.

    Tongkat Ali wächst als Strauch und kann Höhen von bis zu 12 Metern erreichen. Typischerweise dauert es 10-15 Jahre bis solche Ausmaße erreicht werden. Der Strauch trägt anfangs grüne Blüten die sich im Laufe der Zeit, bis zur Reife, rot färben.

    Allerdings sind die Blüten im Hinblick auf die Wirkung der Pflanze weniger interessant. Der wertvolle Teil verbirgt sich unter der Erde.
    Im Laufe der Jahre wachsen die Wurzeln teilweise senkrecht und bis zu mehrere Meter tief in den Boden, was die Ernte zu einer schweren und mühsamen Arbeit werden lässt. Je älter die Wurzeln sind, desto höher ist auch die Anzahl der Wirkstoffe Butea Superba.
    Von daher gilt es einige Dinge zu beachten sollten Sie Tongkat Ali kaufen wollen.

    ReplyDelete