If no internal TLB matches the current access, the MMU needs to find elsewhere the mapping information.
In the SPARC V8 reference MMU, mapping information is placed in tables in main memory and lookup is entirely done by hardware. Some other CPU architectures rely on traps and let the OS be responsible for updating the TLBs.
These pages are arranged in a three levels tree and together map the entire 4GB address space:
- The 4GB page can be either used alone, or split into 64 pages of 256MB.
- Each page of 256MB can be independently configured as a continuous area, or split into 256 pages of 1MB.
- Each page of 1MB can be independently configured as a continuous area, or split into 256 pages of 4kB.
- Each page of 4kB represent the smallest indivisible part of memory management.
Memory areas cannot overlap.
Some CPU ISA provide only small pages and/or have other divisions, for example, traditional 32bits x86 also uses 4kB pages but a two levels hierarchy : 4GB = 1024*4MB. 4MB = 1024*4kB.
The hierarchical structure is an efficient method for reducing the size of the page tables, particularly when the installed memory is smaller than 4GB. Mapping each 4kB page individually requires at worst (64*64*256+64*256+256)*4 ≈4.06MB per context.
All of the 4GB address space corresponds to a entry somewhere. That entry can either map physical memory or indicate that the area is unavailable. If an access is made to an address marked as non mapped, a trap is triggered and the operating system decides whether it is an invalid access due to a software error or if it is just waiting to be filled from disk, zeroed…
When starting a program under Linux (or Windows,…), the operating system do not have to read the whole binary from disk before starting execution. The OS just have to create a new process then start from there. Memory is loaded and allocated on-demand, only following CPU accesses: code fetches, data reads or writes.
When the mapping tables are updated in memory, some TLBs may require to be flushed in order to preserve coherency. It must be done by the operating system, typically using special supervisor-only instructions. In the SPARC case, it is done with STA instructions to the “MMU flush/probe” address space.
The MMU accesses in RAM many tables which contain either pointers to next-level tables “PTD” or page entries “PTE”.
PTE and PTD entries can be shared between contexts: The kernel, kernel data, IO mappings, use the same address in all contexts.
|1)||Read ContextTablePointerRegister[ContextRegister] → L0 PTD table base pointer|
|2)||Read L0Table[VA(31:24)] → L1 PTD table base pointer|
|3)||Read L1Table[VA(23:18)] → L2 PTD table base pointer|
|4)||Read L2Table[VA(17:12)] → L3 PTE entry, which indicates the physical address and access conditions for this page. This entry is copied in a TLB.|
|5)||Update the PTE entry, if needed: Modified or Referenced bit.|
|6)||And, finally, do the actual read or write operation, eventually as a part of a cache fill burst transfer.|
Larger pages need fewer accesses and correspond to L0, L1 or L2 PTEs.
All MMU tables must be aligned, so, depending on page size and number of entries (1024 L0 entries, 256 L1 & L2 entries), some bits can be ignored in the PTE and PTD (grayed areas in the figure above).
In order to accelerate a bit this “table walking” operation, intermediate values are kept by our MMU:
– L0 PTP pointer. Keeping that pointer saves one memory access (the first one above). This pointer must be reloaded after the context register or table pointer register is changed or when the context table is modified.
– L2 PTP for Instructions and L2 PTP for Data. If successive accesses fall within the same 256k region of virtual memory, only one tablewalking access is necessary when a TLB miss occurs (accesses (1), (2) and (3) above are skipped). Like the L0 PTP cache, these pointers must be discarded when the MMU mapping registers are altered or if page tables are modified.
There is no dedicated instruction for flushing these intermediate pointers, it is the TLB flush instructions that indicate that the memory mapping has been modified.
Compared to “real” designs which have tens of fully associative TLBs, our FPGA implementation is quite TLB-starved, because fast CAMs in FPGAs uses a lot of resources.
Caching intermediate PTDs is therefore very important as we get TLB misses often.
(As a comparison: TI SuperSparc2 keeps 1 instruction and 4 data L2 PTDs)
PTE and PTD
Each PTE or PTD is made of several fields. 32bits SPARC use 32bits entries, like, for example, x86.
|11||Reserved (or little-endian mode on some CPUs, not ours)|
Invalid entries, when accessed, trigger a page faut exception that indicates that the memory is not mapped. For example the program or data is still on disk and not yet loaded into memory.
Access conditions. See previous article. Copied into the TLB.
The tablewalking state machine do not check that the access is authorised. This is done after the entry has been copied into a TLB.
The “Referenced” (aka ‘accessed’) bit indicates whether a page has been accessed, either as code fetch, data read or data write. An operating system can use this information for many purposes, for example when using memory as a disk cache, never accessed areas should be discarded first.
This bit provides a conservative evaluation:
- R=0: The page have never been accessed
- R=1: The page may have been accessed for something useful. Or maybe not.
This bit is set by the MMU and is cleared by the operating system.
Modern CPUs do all sort of speculative accesses, the most common type being instruction prefetches. The prefetch unit tries to guess the likely outcome of the program execution, but it can be wrong. In that case, pages may be needlessly accessed and the reference bit may be set. (AFAIK, no CPU ISA guarantees that pages ‘reference’ bits are exact.
PowerPCs documentation clearly states that speculative prefetches or data access may set this bit)
Even a simple CPU like ours can do a few useless instruction fetchs, for example annulled instructions, or just before exceptions.
The MMU automatically updates the R bit after tablewalking. There is no R bit in the TLBs because being in a TLB implies that the R bit is already set.
The modified (aka “dirty” or “changed”) bit, which is copied in the TLBs, indicates that a write access has occurred in the page. This bit is set by the MMU and is cleared by the operating system.
Like the “referenced” bit, modified bits are a bit conservative. If the CPU writes 00 while the memory content is already 00, the modified bit will be set anyway : MMU manages addresses, not data…
Modified bits have many purposes. For example a disk cache needs to write back only modified areas.
This is different from making some memory read-only, then trigger a trap after the first write access, for lazy copying (CoW).
The Cacheable/non cacheable bit in the page tables provide very basic cache management :
I/O ports should not be cached. Simple.
The SPARCv8 standard is quite poor at managing memory properties, for example, there is no way to define an area as either cache write back or write through. Multiprocessor systems can also use special indications to distinguish local or shared pages (see PowerPCs “WIMG” bits.) Some of these properties correspond to hardware resources and physical address ranges (for example, a framebuffer should be configured as write-through, some peripherals are little endian), so they are not really needed in the virtually addressed page tables. (see x86s “MTRR” registers)
Software and hardware tablewalking
First SPARC MMUs on early Sun4 computers used software based management of TLBs whithout HW tablewalking. There were also primitive MMUs in SPARC-based Solbourne computers. The embedded SparcV8e variant also features optional software tablewalking and software management and locking of TLBs (for time determinism). It is often also possible to preload TLBs, but this is more a chip test feature than an actual operating mode.
– The 64bits SPARCv9 CPU TLBs are software managed as does MIPS, PA-RISC CPUs.
– x86, MC68K, ARMs use hardware tablewalking.
– PowerPCs can use both and use hash tables for hardware TW. Broken by design.
- Using HW tablewalk enables faster TLB updates, allowing a reduced number of TLBs. Using SW tablewalk enables more flexible page tables, configurable page sizes and direct control of the TLBs by the Operating System.
- Software tablewalk necessitates to keep identical the TLB contents and/or instructions accessing the TLBs if one wants to limit porting efforts. Hardware tablewalk necessitates to document and standardize the page table format which should be less dependant on the CPU design.
- Hardware pagetables requires enough area to map the whole memory whereas the software management permits on-demand generation of TLB entries from OS structures (that may have any format, eventually architecture agnostic).
Some CPUs provide both.
The number of TLBs has a critical impact on CPU performance, many CPUs have multi-layer TLB caches with few fast fully associative ITLBs and DTLBs and a large but slower shared Instruction and Data TLB cache.
MMU Tablewalking is picking a few 32bits values a few kilobytes apart, in memory areas unrelated to currently executing software. For our CPU, like many others, MMU page tables cannot be cached in the first level cache. MMU accesses are slow.