Deep Dive: MMU Virtualization with Xen on ARM
Introduction
This article will explore how the Xen hypervisor uses the Memory Management Unit (MMU) on ARM architecture to support virtualization. We’ll take a brief look at the history and purpose of the MMU then dive into Xen specifics, focusing on the ARMv8 implementation.
A brief MMU history lesson
Back when my father (that’s him above on the left) worked on computers there was just physical memory, there wasn’t much of it, and it was very expensive (1960 IBM 1401 core memory costs $5,242,880 per Mbyte). Programs had to either fit into main memory or use complex logic like overlays to manage swapping data between primary memory and secondary storage. It’s estimated that programmers spent up to two-thirds of their time designing, implementing and debugging overlay strategies.
A generalized solution was required for the overlay problem. The German physicist Fritz-Rudolf Güntsch introduced the concept of virtual memory in his doctoral thesis, Logical Design of a Digital Computer with Multiple Asynchronous Rotating Drums and Automatic High Speed Memory Operation. The thesis described a machine with hardware that automatically moved blocks of data between primary and secondary drum memory.
By the 1960s most major computer manufacturers had decided virtual memory was the right way to go. Not only did virtual memory alleviate the need for programmers to handle overlays, it provided for an address space which is larger than physical memory, the ability for several programs to reside in memory simultaneously, and a way to logically partition memory among multiple programs, preventing them from interfering with each other or the operating system. Additionally, it allowed memory blocks to have enforceable attributes, such as read-only pages. As with any feature that starts with the word virtual, there are downsides. Implementing virtual memory requires complex logic and resources. Each time a program accesses memory, the virtual address must be translated into a physical address. If the block of data being requested does not actually reside in main memory, it must be retrieved from secondary storage. To help alleviate the performance penalties of translating virtual addresses to physical addresses, dedicated memory management unit (MMU) hardware was implemented. MMUs typically work by dividing the virtual address space into pages of equal size. An in-memory page table contains page table entries (one per page) that provide a mapping between virtual addresses and physical addresses. The MMU directly accesses these data structures and normally caches frequently used entries in the translation lookaside buffer (TLB). Just as an operating system uses an MMU and virtual memory to keep processes separated for security reasons, a hypervisor (such as Xen), uses the MMU to keep the memory used by virtual machines separated.
Should an MMU always be used when designing a system?
Short answer is no, there are downsides to virtual memory and MMUs which affect system designs:
cost and complexity of MMU hardware and circuitry
software complexity required to manage the MMU hardware
performance impact of doing virtual to physical address translation
performance impact incurred when reprogramming the MMU during context switch
lack of determinism during page fault handling
fragmentation of physical memory
With respect to MMU usage, system designs can be divided into four rough categories:
Multi-tasking thread model (no MMU) – Most RTOS products use the thread model where a single multi-threaded process runs along with the RTOS in a single address space. This has the benefit of being simple, low cost, and fast. It lacks security and can make it harder to debug problems – ever tried to track down random memory corruption?
Multi-tasking process model (MMU) – Feature-rich operating systems (Linux, Windows, etc.) use the process model. The operating system uses the MMU to allow multiple processes to run simultaneously. Each process occupies its own private address space starting at address zero. Each virtual address space can be much larger than the actual physical memory present on the system. This is a much more secure, flexible, and general purpose design.
Multi-tasking thread protected model (limited MMU) – Some RTOS products offer a compromise by making limited use of an MMU while still using the thread model. This approach does not remap any memory, but it protects memory that does not belong to the current thread. This has lower overhead, but provides some security. This category also includes systems that use a Memory Protection Unit (MPU), which is a trimmed down version of an MMU providing only memory protection support. The MPU allows the definition of memory regions with memory access permission and memory attributes assigned to each region. The main purpose of an MPU is to prevent a process from accessing memory that has not been allocated to it, which prevents a bug or malware within a process from affecting other processes or the operating system.
Other (no MMU) – This includes non-multitasking operating systems, unikernels, systems without a real operating system, etc.
How does an MMU really work?
Let’s dig in a little deeper to understand the details of how an MMU works, using the ARMv8 architecture as a reference.
The MMU is a piece of hardware that can be configured by software running at an appropriate privilege level. Each processor core has an MMU. Each MMU contains:
a Translation Lookaside Buffer (TLB) which stores recently used translations
a Table Walk Unit which reads translation table entries from memory (also known as page tables).
Prior to enabling the MMU, the page tables must be appropriately set up and the hardware must be told where they can be found in memory. Once the MMU is configured with a set of translation tables, all code running at that privilege level or lower will have its access to memory restricted by the parameters of the MMU. The MMU controls the cache policy, memory attributes, and access permissions. All memory accesses issued by software use virtual addresses, requiring the MMU to translate the virtual address to a physical address for each access.
For each translation, the MMU first checks the TLB for an existing cached translation. If a cached translation isn’t found, the Table Walk Unit searches the translation table in memory for the corresponding page table. Once a suitable translation is found, the MMU checks permissions and attributes. Then the memory access is either allowed to proceed using the resulting physical address or a fault is signaled. The actual data residing in the physical memory page could have been swapped out to secondary storage, in that case a page fault is signaled, the data is brought in from secondary storage and transferred to memory, then the access is restarted.
The translation tables work by dividing the virtual address space into equal-size blocks and providing one entry in the table per block. Each entry contains the address of the corresponding block of physical memory and the attributes to use when accessing the physical address.
During the table lookup, the virtual address is split into two parts:
The upper-order bits are used as an index into the page tables. The entry will contain the physical address for the virtual address.
The lower-order bits are an offset within that block and are not changed by the translation. The physical address obtained from the table lookup is combined with the block offset to form the actual physical address used to access main memory.
In a single-level lookup, the virtual address space is split into equal-size blocks, normally called pages. In practice, a multi-level hierarchy of tables is used.
Multi-level page tables are built in a tree-like structure of sub-tables to store logical-to-physical translations. This is an efficient way to keep the page tables as small as possible. The branches in the tree are called page directories. Entries in the page directories point to the next level of the tree. The last level, the leaves, are the entries holding the final physical address translation. The top level tables divide the address space into large blocks, each entry in this table can point to an equal-sized block of physical memory or it can point to another table which subdivides the block into smaller blocks. The indices selecting the table entry for each level are calculated from bit slices of the virtual address.
In Armv8-A, the maximum number of levels is four, and the levels are numbered 0 to 3. This multilevel approach allows both larger blocks and smaller blocks to be described, some of the advantages and disadvantages of each are:
Larger memory blocks require fewer levels of table reads to translate than smaller blocks.
Larger blocks are more efficient to cache in the TLB, ie one entry covers a larger memory area
Larger pages can be more efficient to page in from a storage device if seek times are high
Smaller blocks give software finer grained control over memory allocation, less wasted memory
Smaller blocks give software finer grained control over memory attributes
Smaller blocks require more page table entries, which uses more memory for kernel purposes
To manage this trade-off, an OS must balance the efficiency of using large mappings against the flexibility of using smaller mappings for optimum performance.
A translation granule is the smallest block of memory that can be described. Armv8-A supports three different granule sizes: 4KB, 16KB, and 64KB.
Processors that implement the ARMv8-A architecture are typically used in systems running a feature-rich operating system with many applications or tasks that run concurrently. When an application starts, the operating system allocates a set of translation table entries that map both the code and data used by the application to physical memory. Each application has its own unique translation tables residing in physical memory.
Normally, page tables for multiple tasks are present in the memory system. The kernel scheduler periodically transfers execution from one task to another. This is called a context switch. During the context switch, the kernel configures the MMU to use the translation table entries for the next process. The kernel will ensure that the page tables are configured such that applications cannot access physical memory belonging to other applications.
Since virtual address ranges will overlap between tasks, the TLB needs to have a mechanism to prevent returning cached translations for the wrong application. One approach to this problem is to flush the TLB on each context switch, but this approach has serious performance issues. Instead, the ARM architecture uses Address Space Identifiers (ASIDs) to mitigate this problem. An application is assigned an ASID by the OS and all the TLB entries for that application are tagged with the ASID. This allows TLB entries for different applications to coexist in the TLB, without the possibility that one application uses the TLB entries that belong to a different application.
Let’s look at a specific ARM-v8 address translation using a 4KB granule size and 48 virtual address bits.
How does a hypervisor such as Xen use the MMU?
As a quick reminder, Xen makes it possible to run many instances of an operating system or different operating systems in parallel on a single physical machine. As a baremetal hypervisor, Xen runs directly on hardware and is responsible for handling CPU, memory, timers, interrupts, etc. It is the first program to run after exiting the bootloader.
On top of the hypervisor run a number of virtual machines. A running instance of a virtual machine is called a domain or guest. Within a guest VM, the operating system must continue to use the MMU to maintain separation between the kernel and numerous applications running in parallel. The ARM architecture employs a two stage memory translation scheme to allow hypervisors to maintain separation between multiple guests.
Two stage translation allows the hypervisor to control guests’ views of memory in the same way the operating system controls applications’ views of memory. This includes restricting the physical memory guests can access, restricting memory-mapped system resources guests can access, and mapping the real physical addresses to intermediate physical address.
When running under a hypervisor, the guest operating system will actually be using intermediate physical addresses when it thinks it is using physical addresses. The operating system configures the stage one translation tables and the hypervisor configures the stage two translation tables. Each memory access from applications running in guest VMs will undergo two stages of translation in the MMU. The MMU will first use the stage one tables to convert the virtual address to an intermediate physical address, then use the stage two tables to convert the intermediate physical address to a real physical address as shown below:
Each VM is assigned a Virtual Machine Identifier (VMID). The VMID is used to tag translation lookaside buffer (TLB) entries, to identify which VM each entry belongs to. This tagging allows translations for multiple VMs to be present in the TLBs at the same time. Each VM has its own ASID namespace so the VMID and ASID are combined when tagging TLB entries.
The stage one and stage two mappings both include attributes, such as type and access permissions. The MMU combines the attributes from the two stages to give a final effective value. The MMU does this by selecting the stage that is more restrictive.
The ARM architecture defines two physical address spaces. A Secure address space and a Non-secure address space. In theory the Secure and Non-secure physical address spaces are independent of each other, and exist in parallel. A system could be designed to have two entirely separate memory systems. However, most real systems treat Secure and Non-secure as an attribute for access control. The Normal (Non-secure) world can only access the Non-secure physical address space. The Secure world can access both physical address spaces. On top of these two physical address spaces there are several independent virtual address spaces.
The diagram shows three virtual address spaces:
NS.EL0 and NS.EL1 (Non-secure EL0/EL1)
NS.EL2 (Non-secure EL2)
EL3
Each of these virtual address spaces (translation regimes) is independent, and has its own settings and tables. There are also virtual address spaces for Secure EL0, Secure EL1 and Secure EL2 which are not shown in the diagram. The diagram also shows how stage 2 translation only happens for Non-secure EL0 and Non-secure EL1 in support of virtualization.
Conclusion
The MMU is a key piece of hardware for enabling secure, performant virtualization. Its use to isolate virtual machines is a natural extension from its original purpose of allowing applications to share a limited physical address space.