MCU : TableWalking

If no internal TLB matches the current access, the MMU needs to find elsewhere the mapping information.

In the SPARC V8 reference MMU, mapping information is placed in tables in main memory and lookup is entirely done by hardware. Some other CPU architectures rely on traps and let the OS be responsible for updating the TLBs.

Some support both.
mcu_table

These pages are arranged in a three levels tree and together map the entire 4GB address space:

  • The 4GB page can be either used alone, or split into 64 pages of 256MB.
  • Each page of 256MB can be independently configured as a continuous area, or split into 256 pages of 1MB.
  • Each page of 1MB can be independently configured as a continuous area, or split into 256 pages of 4kB.
  • Each page of 4kB represent the smallest indivisible part of memory management.

Memory areas cannot overlap.
Some CPU ISA provide only small pages and/or have other divisions, for example, traditional 32bits x86 also uses 4kB pages but a two levels hierarchy : 4GB = 1024*4MB. 4MB = 1024*4kB.
The hierarchical structure is an efficient method for reducing the size of the page tables, particularly when the installed memory is smaller than 4GB. Mapping each 4kB page individually requires at worst (64*64*256+64*256+256)*4 ≈4.06MB per context.
All of the 4GB address space corresponds to a entry somewhere. That entry can either map physical memory or indicate that the area is unavailable. If an access is made to an address marked as non mapped, a trap is triggered and the operating system decides whether it is an invalid access due to a software error or if it is just waiting to be filled from disk, zeroed…
When starting a program under Linux (or Windows,…), the operating system do not have to read the whole binary from disk before starting execution. The OS just have to create a new process then start from there. Memory is loaded and allocated on-demand, only following CPU accesses: code fetches, data reads or writes.
When the mapping tables are updated in memory, some TLBs may require to be flushed in order to preserve coherency. It must be done by the operating system, typically using special supervisor-only instructions. In the SPARC case, it is done with STA instructions to the “MMU flush/probe” address space.

Tables

PTD sizes

The MMU accesses in RAM many tables which contain either pointers to next-level tables “PTD” or page entries “PTE”.
PTE and PTD entries can be shared between contexts: The kernel, kernel data, IO mappings, use the same address in all contexts.

PTE, PTD
To read or write a byte, if no TLB match, up to 6 memory accesses may be needed:

1) Read ContextTablePointerRegister[ContextRegister] → L0 PTD table base pointer
2) Read L0Table[VA(31:24)] → L1 PTD table base pointer
3) Read L1Table[VA(23:18)] → L2 PTD table base pointer
4) Read L2Table[VA(17:12)] → L3 PTE entry, which indicates the physical address and access conditions for this page. This entry is copied in a TLB.
5) Update the PTE entry, if needed: Modified or Referenced bit.
6) And, finally, do the actual read or write operation, eventually as a part of a cache fill burst transfer.

Larger pages need fewer accesses and correspond to L0, L1 or L2 PTEs.

All MMU tables must be aligned, so, depending on page size and number of entries (1024 L0 entries, 256 L1 & L2 entries), some bits can be ignored in the PTE and PTD (grayed areas in the figure above).
In order to accelerate a bit this “table walking” operation, intermediate values are kept by our MMU:
– L0 PTP pointer. Keeping that pointer saves one memory access (the first one above). This pointer must be reloaded after the context register or table pointer register is changed or when the context table is modified.
– L2 PTP for Instructions and L2 PTP for Data. If successive accesses fall within the same 256k region of virtual memory, only one tablewalking access is necessary when a TLB miss occurs (accesses (1), (2) and (3) above are skipped). Like the L0 PTP cache, these pointers must be discarded when the MMU mapping registers are altered or if page tables are modified.
There is no dedicated instruction for flushing these intermediate pointers, it is the TLB flush instructions that indicate that the memory mapping has been modified.

Compared to “real” designs which have tens of fully associative TLBs, our FPGA implementation is quite TLB-starved, because fast CAMs in FPGAs uses a lot of resources.
Caching intermediate PTDs is therefore very important as we get TLB misses often.
(As a comparison: TI SuperSparc2 keeps 1 instruction and 4 data L2 PTDs)

PTE and PTD

Each PTE or PTD is made of several fields. 32bits SPARC use 32bits entries, like, for example, x86.

ET

Entry type.

00 Invalid
01 PTD
10 PTE
11 Reserved (or little-endian mode on some CPUs, not ours)

Invalid entries, when accessed, trigger a page faut exception that indicates that the memory is not mapped. For example the program or data is still on disk and not yet loaded into memory.

ACC

Access conditions. See previous article. Copied into the TLB.
The tablewalking state machine do not check that the access is authorised. This is done after the entry has been copied into a TLB.

Referenced

The “Referenced” (aka ‘accessed’) bit indicates whether a page has been accessed, either as code fetch, data read or data write. An operating system can use this information for many purposes, for example when using memory as a disk cache, never accessed areas should be discarded first.
This bit provides a conservative evaluation:

  • R=0: The page have never been accessed
  • R=1: The page may have been accessed for something useful. Or maybe not.

This bit is set by the MMU and is cleared by the operating system.
Modern CPUs do all sort of speculative accesses, the most common type being instruction prefetches. The prefetch unit tries to guess the likely outcome of the program execution, but it can be wrong. In that case, pages may be needlessly accessed and the reference bit may be set. (AFAIK, no CPU ISA guarantees that pages ‘reference’ bits are exact.
PowerPCs documentation clearly states that speculative prefetches or data access may set this bit)
Even a simple CPU like ours can do a few useless instruction fetchs, for example annulled instructions, or just before exceptions.

The MMU automatically updates the R bit after tablewalking. There is no R bit in the TLBs because being in a TLB implies that the R bit is already set.

Modified

The modified (aka “dirty” or “changed”) bit, which is copied in the TLBs, indicates that a write access has occurred in the page. This bit is set by the MMU and is cleared by the operating system.
Like the “referenced” bit, modified bits are a bit conservative. If the CPU writes 00 while the memory content is already 00, the modified bit will be set anyway : MMU manages addresses, not data…
Modified bits have many purposes. For example a disk cache needs to write back only modified areas.

This is different from making some memory read-only, then trigger a trap after the first write access, for lazy copying (CoW).

Cacheable

The Cacheable/non cacheable bit in the page tables provide very basic cache management :
I/O ports should not be cached. Simple.

The SPARCv8 standard is quite poor at managing memory properties, for example, there is no way to define an area as either cache write back or write through. Multiprocessor systems can also use special indications to distinguish local or shared pages (see PowerPCs “WIMG” bits.) Some of these properties correspond to hardware resources and physical address ranges (for example, a framebuffer should be configured as write-through, some peripherals are little endian), so they are not really needed in the virtually addressed page tables. (see x86s “MTRR” registers)

Software and hardware tablewalking

First SPARC MMUs on early Sun4 computers used software based management of TLBs whithout HW tablewalking. There were also primitive MMUs in SPARC-based Solbourne computers. The embedded SparcV8e variant also features optional software tablewalking and software management and locking of TLBs (for time determinism). It is often also possible to preload TLBs, but this is more a chip test feature than an actual operating mode.

– The 64bits SPARCv9 CPU TLBs are software managed as does MIPS, PA-RISC CPUs.
– x86, MC68K, ARMs use hardware tablewalking.
– PowerPCs can use both and use hash tables for hardware TW. Broken by design.

  • Using HW tablewalk enables faster TLB updates, allowing a reduced number of TLBs. Using SW tablewalk enables more flexible page tables, configurable page sizes and direct control of the TLBs by the Operating System.
  • Software tablewalk necessitates to keep identical the TLB contents and/or instructions accessing the TLBs if one wants to limit porting efforts. Hardware tablewalk necessitates to document and standardize the page table format which should be less dependant on the CPU design.
  • Hardware pagetables requires enough area to map the whole memory whereas the software management permits on-demand generation of TLB entries from OS structures (that may have any format, eventually architecture agnostic).

Some CPUs provide both.
The number of TLBs has a critical impact on CPU performance, many CPUs have multi-layer TLB caches with few fast fully associative ITLBs and DTLBs and a large but slower shared Instruction and Data TLB cache.

MMU Tablewalking is picking a few 32bits values a few kilobytes apart, in memory areas unrelated to currently executing software. For our CPU, like many others, MMU page tables cannot be cached in the first level cache. MMU accesses are slow.

6 thoughts on “MCU : TableWalking

  1. Interesting update, I haven’t heard anyone mention sparc v6 anywhere was there ever a specification for that or was it more of an internal research thing?

    I’m in the process of upgrading my T2000 at the moment so my gentoo stages are non accesible (hopefully up by next week and hopefully updated Gentoo stage3 images after that)

    Also I am running into crashes on real hardware which seems to be a bug in python which I need to get sorted out it only occurs on real V8 hardware an not v9 though I don’t think it occurs on qemu anymore either since it pretends to be a 64bit processor even when running v8 code from what I gather at least register wise.

    Chase

    • Hello Chase!

      I have removed the V6.
      The first Sparcs, “MB86900”, “B5000″… were used before the “Reference MMU” was defined, and the MMU is merely an annex in the standard.
      I have the 7C601 databook, and nowhere the version (e.g. V7) is indicated, there is just ‘SPARC architecture’. other datasheets from early SPARCs don’t mention any version either.

      QEMU do not imitate exactly the behavior of actual hardware, there is no cache, no TLBs, no issues with memory coherency and probably a few bugs remaining (NextSTEP for Sparc still do not work !)
      AFAIU, QEMU when running 32bits (Sun4m) emulation, use a 32bits CPU model (MicroSparcII as default) with 32bits registers (include/exec/cpu_defs.h : target_ulong, target-sparc/cpu.h : CPUSPARCState)
      There are some corner cases which are handled differently on QEMU and real hardware, for example when some registers are updated (Y, PSR…), the two following instructions should not depend on the register value : Real CPUs need the delay, QEMU apply changes immediately.

      I will try your Gentoo build soon ! I am trying to finish the R4 version, it was far more difficult than expected.

      • I’ve started looking into how to generate the Memory controller for my FPGA board as well.. im not entirely sure where to hook it into your code though.

        It will take me a week or two to get the gentoo stages updated … I had some further delays (I forgot to get drive caddies for my new drives).

        Thre are a couple guys hanging out in the #gentoo-sparc IRC chatroom that have sparc32 (some sun4m and sun4d) hardware also that will enjoy you work as well.

        • I may get the same board as you, to try Alteras… It is also probably more popular than the Xilinx SP605 board.

          I can run Debian Etch 4.0 with Gnome, the latest that supported Sparc32, but it is painfully slow. Of course a 50MHz CPU is not fast, but another major issue is that with only 128MB of RAM, there is no disk cache, and lots of swapping. The 512MB of the Terasic board should be far better.

          Anyway, you may find some interest in the upcoming R4 enhancements…

  2. Here’s a link to the LPDDR2 setup tutorial its on a 32x port as well… unlike someboards that only access ram on a 16x port. Do you think the 4Mb Sram would be of any use for cache or as video memory it has a 16x wide bus?

    https://www.youtube.com/watch?v=fYmAIOwyO3o

    Also some of the funcality is on overlapping pins… mainly the HSMC and Arduino expansion but I think some other things also.

    Someone has written a gameboy emulator for it https://github.com/geky/gb so that should give you some ideas as to how to get video output working though I haven’t tested anything at all on my board lately. I think the guy making the gcvideo boards has the i2c init code working directly as well. I imagine it would be rather difficult to use the LPDDR2 without the hardcores … I think you can distribute the netlists for those but I’m not sure.

    Also have you seen the Venus822 chips? You can only buy them as part of a GPS board but still rather interesting its a Leon3 Sparc in a rather arduino-ish package.

    • – I need to install Quartus to better understand that video. I have not used Alteras for ages.
      Interfacing dynamic RAM is always a mess, there are high frequency clocks and PLLs and many timing parameters. Having sample designs is very helpful. There are a few Xilinx copyrighted files for the RAM controller in TEMLIB because it is more convenient that re-generating them through “wizards”, and because I had to modify some of them to reuse the RAM clocks for the CPU. Maybe I should instead only distribute diff files.
      To interface Altera’s DDR2 controller macro, a “plomb to Avalon” bridge is needed, remplacing the “plomb to MIG” (plomb_mig.vhd file) for interfacing Xilinx’s DDR controller.

      – The cache should be based on the FPGA internal RAM blocks to have high frequency, wide busses and short latency. There are enough of it for 16K L1 caches and a future 128K L2 cache.

      – Currently, the only supported resolution is 1024x768x256c, which needs a 768KB framebuffer.
      The first use I could imagine for the external static RAM for the SparcStation project would be for storing OpenBIOS. The binary size is now 414KB, maybe it could be reduced as I’m not sure that all the data sections linked together need to be copied. There is already an interface for a 16bits-wide FLASH (aflash.vhd) which would work with minor changes with the static RAM.

      – The Terasic’s board video interface seems straightforward : 24bits colours and synchro. It is a bit simpler than the Xilinx board which uses a double data rate DVI encoder. Programming the chip can be done in software by bit-banging the I²C bus. It is done (crudely) in the patched OpenBIOS (video_common.c) for programming a Chrontel CH7301 chip. Using software configuration saves resources in the FPGA, it is more flexible and delaying slightly the boot sequence does not matter…

      – There are a few LEON chips used for ground applications, even if it is famous for its radiation-hardened space variants. Nowadays, ARM is everywhere and it is difficult to compete against that. LEON started as an ESA project for replacing the obsolete ERC32 chips which were derived from the commercial 7C601 SPARC chipset which was used in early SparcStations. Despite all that, they did not try to make LEON compatible with exisiting Sparcs (unlike TEMLIB’s CPU which tries to imitate a MicroSparcII), there are a few supervisor access registers which are different, it can be seen in the Linux kernel.

Leave a Reply

Your email address will not be published. Required fields are marked *