ARM Cortex-A Interrupt Latency

In this article, I’ll explore the interrupt latency, also known as interrupt response time, of an ARM Cortex-A9 under various scenarios — and yes, it’s still on the Xilinx Zynq-7000, since I still have that board on my desk from the last two articles. An upcoming follow-up article will describe methods of improving worst case latency.

Histogram showing distribution of interrupt latency on a Cortex-A9. — **Figure 1 –** Interrupt latency distribution for a Cortex-A9

Embedded Systems and Application Processors

With the ever-increasing requirements of modern embedded applications, it’s not surprising that more and more device designs are making the jump from smaller embedded chips to more powerful application class processors. And this isn’t very surprising — even their names could have helped anyone foretell this trend. MCUs, or microcontrollers, are designed to perform control tasks, while microprocessors, on the other hand, are more suited for processing data. Modern applications are all about data.

Often the first question that arises, when starting or migrating a design to an application processor, is about interrupt latency. Interrupt response time is often stated as a simple number, which at the end of the day doesn’t mean much. If I wanted to give a sales pitch, I could say that I can get you to the first instruction of your ISR in less than a 100 cycles, which on your typical MPU would be less than 200 ns. That’s right, nanoseconds. But that doesn’t mean much, because if an application is latency-sensitive, that’s just about the worst case scenario. Worst-case interrupt response time is application-dependent, with a mix of variables affecting the results, including memory type, CPU configuration, and application behaviour.

Interrupt Latency on the Cortex-A

Interrupt latency is mostly affected by how much time it takes for the CPU to fetch the necessary instructions and data in order to process the interrupt entry sequence. On a higher-end platform, this is complicated by the more complex memory hierarchy and memory management. For example, in a Cortex-M, an instruction or data fetch can be done in a well-bounded amount of time, while a Cortex-A requires many more steps when fetching from memory.

Getting an instruction or data will first require a TLB lookup, which may require a page table walk, and which may or may not be cached. And at this point, the access itself hasn’t been done. A memory access must cross the entire memory hierarchy in search of either a cache hit or the underlying memory. Most modern Cortex-A processors are implemented with a 4-level memory hierarchy. The L1 and L2 cache are usually named as such, while the central interconnect would be L3 and external memory located at L4.

Rather than belabouring the point with a long theoretical discussion, I’ve simply measured interrupt latency under a few configurations with varying cache states. Measurements were done by using a software-generated interrupt on a Zynq-7000 running at 667Mhz. The values are quoted in CPU cycles, and includes the measurement and interrupt generation overhead.

The table contains measurements for four different configurations. Two of those configurations feature code and data stored in external DDR memory, and two with code and data in the on-chip RAM (OCRAM). Both configurations were tested with memory marked as cacheable and non-cacheable.

Interrupt Latency
Configuration	Scenario	Min	Max	Avg
DDR Cached	Normal	242	242	242
DDR Cached	TLB Invalidate	326	496	328
DDR Cached	BP Invalidate	298	298	298
DDR Cached	TLB+BP Invalidate	384	478	396
DDR Cached	L1I Clean	400	414	404
DDR Cached	L1D Clean	396	420	408
DDR Cached	L1 D+I Clean	532	580	546
DDR Cached	L2 Clean	242	242	242
DDR Cached	L1+L2 Clean	1582	2978	1844
DDR Cached	L1+L2+BP+TLB Clean	1746	2896	1892
OCRAM Cached	Normal	240	240	240
OCRAM Cached	TLB Invalidate	242	242	242
OCRAM Cached	BP Invalidate	298	298	298
OCRAM Cached	TLB+BP Invalidate	298	298	298
OCRAM Cached	L1I Clean	402	424	410
OCRAM Cached	L1D Clean	424	462	448
OCRAM Cached	L1 D+I Clean	588	620	600
OCRAM Cached	L2 Clean	240	240	240
OCRAM Cached	L1+L2 Clean	592	620	594
OCRAM Cached	L1+L2+BP+TLB Clean	580	626	596
DDR Uncached	Normal	2076	3182	2186
DDR Uncached	TLB Invalidate	2100	3204	2210
DDR Uncached	BP Invalidate	2846	4010	2984
DDR Uncached	TLB+BP Invalidate	2874	4052	2996
DDR Uncached	L1I Clean	2818	4062	2980
DDR Uncached	L1I+BP+TLB Clean	2880	3996	2986
OCRAM Uncached	Normal	686	700	690
OCRAM Uncached	TLB Invalidate	692	694	692
OCRAM Uncached	BP Invalidate	822	822	822
OCRAM Uncached	TLB+BP Invalidate	822	822	822
OCRAM Uncached	L1I Clean	822	822	822
OCRAM Uncached	L1I+BP+TLB Clean	822	822	822

Interrupt latency results

Analysis

Let’s look first at the normal configuration, which would be code and data stored in cached external DDR. The best case interrupt latency is 242 cycles, which for a 667 Mhz Zynq is about 0.36 µs — not bad. Worst case in the measured scenarios is with a clean L1 and L2 caches, as well as empty TLB and branch predictors. In that later case, measurements show a worst case of 3996 cycles or nearly 6 µs. Moving code and data into on-chip memory gives a very similar best-case; this makes sense, since this is with a warm L1 cache, so memory access never reaches L2. The worst case is improved to 626 cycles/0.9 µs. A point of note is that on the Zynq, the OCRAM is located at the same level as the L2 cache, so access to the OCRAM from the CPU do not have to cross the central interconnect.

There are also a few oddities that might seem strange at first. One of them is the impact of a cold TLB, which doesn’t count for much, or in some cases at all, if we trust the measurements. For example, when running from on-chip RAM. clearing the TLB seems to have no effect. This is mostly because the code and data fit in a full page, as such only one fetch is required from the page table. In those cases, even if the TLB is flushed before measuring, the access is done prior to the measurement, and isn’t captured in the data. The memory layout for the configuration in external memory is a little different, and requires two table lookups instead of one.

Parting Words

The measurements, however, do not reflect the absolute worst case scenario, as we are considering clean caches with a mostly idle SoC. It’s important when qualifying a design for worst-cast interrupt response time to use the complete application. If possible, I’ll try to simulate a mostly dirty cache with some interconnect contention before the next post, which will be about how to mitigate interrupt latency in a Cortex-A based system.

Click here to read the follow-up article on how to improve the interrupt latency on a Cortex-A9.

Questions or comments?

Do not hesitate to contact us at blog@jblopen.com. Your questions, comments and, suggestions are appreciated.