Counting Cache Hits and Misses on an ARM Cortex-M33
Introduction
The Instruction Cache (ICACHE) on the STM32H5 is an 8KB cache memory positioned between the ARM Cortex-M33 CPU and the MCU Bus Matrix. It connects to the Cortex-M33 via the C-AHB (Code) Bus and features two master ports: M1 (128-bit) and M2 (32-bit). The M1 port leads to a multiplexer that distributes access between external Flash memory and SRAM (through the Bus Matrix), while the M2 port directly interfaces with external memory controllers (OCTOSPI & FSMC) via the Bus Matrix.
This ICACHE is a 2-way associative cache with a 16-byte cache line structure, comprising 256 sets, each containing two cache lines. It employs Hit-Under-Miss and Critical-Word-First refill strategies and includes a remapping feature that enables caching of up to four external memory regions by aliasing their addresses. The cache supports a Direct-Mapped mode, but its default state is n-way associative.
Our objective is to configure the ICACHE peripheral to monitor Hit-and-Miss counters, providing insights into cache utilization. To fully demonstrate ICACHE functionality, we aim to exhaust the 8KB cache capacity, forcing cache line replacements and refills as dictated by the Address TAG mechanism.
Initializing & Configuring ICACHE
Start by disabling the ICACHE. Then, invalidate all cache lines and wait for the invalidation process to finish. Next, enable the Hit and Miss monitors and reset their counters. After that, enable the ICACHE peripheral. At this point, we are ready to proceed.
What about the Data Cache?
The Data Cache (DCACHE) is a 4KB cache memory with a single 32-bit Master Port connected to external memory interfaces, including OCTOSPI and FSMC. The S-Bus (System) of the Cortex-M33 CPU feeds into a multiplexer, which distributes outputs to three SRAM regions and one output to the DCACHE. The DCACHE is designed to cache only external memories, as caching SRAM regions close to the CPU is unnecessary.
However, caching external memories is crucial, as they may require several clock cycles to transfer data. Since external memories are unavailable for this demonstration, this article will focus solely on the ICACHE peripheral.
TIMER Configuration
We are using TIM2 to generate an interrupt every 1 second to update the Serial Terminal displaying the Cache Monitor. The System Clock is derived from the 64 MHz HSI oscillator, scaled by a factor of 2, which is also the operating frequency of the APB1 Bus. To achieve a 1-second interval, we prescale the TIM2 input clock by 3199, reducing it to 10 kHz. Then, by running the counter for 10,000 cycles, we achieve the desired 1-second period.
The interrupt is enabled through the DIER register, and the NVIC functions are used to enable the IRQ and set its priority. Within the IRQHandler, the ICACHE_UpdateMonitor() function is called to read the Hit and Miss counter registers, clear the terminal, display an ASCII graphic, and print the register values.
CACHE Intensive Application
The ICACHE consists of 512 lines, each 16 bytes, totaling 8192 bytes of cache memory. To trigger the need for replacing cache lines using the Critical-Word-First policy, we must fill all the cache lines. This can be achieved by running a Cache Exhaustion application after initializing the ICACHE peripheral. The purpose of the Cache Exhaustion application is to deliberately fill the entire Instruction Cache (ICACHE) of the microcontroller, causing cache misses. The application works by writing data to each cache line, which forces the cache to evict old lines in order to make space for new data.
I will demonstrate how the Data Cache (DCACHE) manages hits and misses after developing drivers for the OCTOSPI and FSMC peripherals. I don't want to work with HAL drivers because I can't learn anything from using them.