When an interrupt occurs, what happens to instructions in the pipeline?

Question

First, terminology:

Usually, at Intel at least, an interrupt is something that comes from the outside world. Usually it is not synchronized with instructions executing on the processor, i.e. it is an asynchronous external interrupt.

In Intel terminology an exception is something caused by instructions executing on the processor. E.g. a page fault, or an undefined instruction trap.

—+ Interrupts flush all instructions in flight

On every machine that I am familiar with – e.g. all Intel processors since the P5 (I worked on the P6), AMD x86s, ARM, MIPS – when the interrupt signal is received the instructions in the pipeline are nearly always flushed, thrown away.

The only reason I say “nearly always” is that on some of these machines you are not always at a place where you are allowed to receive an interrupt. So, you proceed to the next place where an interrupt is allowed – any instruction boundary, typically – and THEN throw away all of the instructions in the pipeline.

For that matter, interrupts may be blocked. So you proceed until interrupts are unblocked, and THEN you throw them away.

Now, these machines aren’t exactly simple 5 stage pipelines. Nevertheless, this observation – that most machines throw away all instructions in the pipeline, in pipestages before the pipestage where the interrupt logic lives – remains almost universally true.

In simple machines the interrupt logic typically lives in the last stage of the pipeline, WB, corresponding roughly to the commit pipestage of advanced machines. Sometimes it is moved up to a pipestage just before, e.g. MEM in your example. So, on such machines, all instructions in IF ID EX, and usually MEM, are thrown away.

—++ Why I care: Avoiding Wasted Work

This topic is near and dear to my heart because I have proposed NOT doing this. E.g. in customer visits while we were planning to build the P6, I asked customers which they preferred – lower latency interrupts, flushing instructions that are in flight, or (slightly) higher throughput, allowing at least some of the instructions in flight to complete, at the cost of slightly longer latency.

However, although some customers preferred the latter, we chose to do the traditional thing, flushing immediately. Apart from the lower latency, the main reason is complexity:

E.g. if you take an interrupt, but if one of the instructions already in flight also takes an exception, after you have resteered IF (instruction fetch) but before any instruction in the interrupt has committed, which takes priority? A: it depends. And that sort of thing is a pain to deal with.

—+++ Folklore: Mainframe OS Interrupt Batching

This is rather like the way that some IBM mainframe OSes are reported to have operated:

with all interrupts blocked in normal operation except for the timer interrupt;
in the timer interrupt, you unblock interrupts, and handle them all;
and then return to normal operation with interrupts blocked mode

Conceivably they might only use such an “interrupt batching” mode when heavily loaded; if lightly loaded, they might not block interrupts.

—+++ Deferred Machine Check Exceptions

The idea of deferring interrupts to give instructions already in the pipeline a chance to execute is also similar to what I call the Deferred Machine Check Exception – a concept that I included in the original Intel P6 family Machine Check Architecture, circa 1991-1996, but which appears not to have been released.

Here’s the rub: machine check errors like (un)correctable ECC errors can occur AFTER an instruction has retired (i.e. after supposedly younger instructions have committed state, e.g. written registers), or BEFORE the instruction has retired.

The classic example of AFTER errors is an uncorrectable ECC triggered by a store that is placed into a write buffer at graduation. Pretty much all modern machines do this, all machines with TSO, which pretty much means that there is always the possibility of an imprecise machine check error that could have been precise if you cared enough not to buffer stores.

The classic example of BEFORE errors is … well, every instruction, on any machine with a pipeline. But more interestingly, errors on wrong-path instructions, in the shadow of a branch misprediction.

When a load instruction gets an uncorrectable ECC error, you have two choices:

(1) you could pull the chain immediately, killing not just instructions YOUNGER than the load instruction but also any OLDER instructions

(2) or you could write some sort of status code into the logic that controls speculation, and take the exception at retirement. This is pretty much what you have to do for a page fault, and it makes such errors precise, helping debugging.

(3) But what if the load instruction that got the uncorrectable ECC error was a wrong path instruction, and never retires because an older inflight branch mispredicted and went another way?

Well, you could write the status to try to make it precise. You should have counters of precise errors and imprecise errors. You could otherwise ignore an error on such a wrong-path instruction – after all, if it is a hard error, it wil either be touched again, or it might not be./ E.g. it is possible that the error would be architecturally silent – e.g. a bad cache line might be overwritten by a good cache line for the same address .

And, if you really wanted, you could set a bit so that if an older branch mispredicts, then you take the machine check exception at that point in time.

Such an error would not occur at a program counter associated with the instruction that caused the error, but might still have otherwise precise state.

I call (2) deferring a machine check exception; (3) is just how you might handle the deferral.

IIRC, all Intel P6 machine check exceptions were imprecise.

—++ On the gripping hand: even faster

So, we have discussed

0) taking the interrupt immediately, or, if interrupts are blocked, executing instructions and microinstructions until an interrupt unblocked point is reached. And then flushing all instructions in flight.

1) trying to execute instructions in the pipeline, so as to avoid wasted work.

But there is a third possibility:

-1) if you have microarchitecture state checkpoints, take the interrupt immediately, never waiting to an interrupt unblocked point. Which you can only do if you have a checkpoint of all relevant state at the most recent “safe to take an interrupt” point.

This is even faster than 0), which is why I labelled it -1). But it requires checkpoints, which many but not all aggressive CPUs use – e.g. Intel P6 dod not use checkpoints. And such post-retirement checkpoints get funky in the presence of shared memory – after all, you can do memory operations like loads and stores while interrupts are blocked. And you can even communicate between CPUs. Even hardware transactional memory usually doesn’t do that.

—+ Exceptions mark the instructions affected

Conversely, exceptions, things like page faults, mark the instruction affected.

When that instruction is about to commit, at that point all later instructions after the exception are flushed, and instruction fetch is redirected.

Conceivably, instruction fetch could be resteered earlier, the way branch mispredictions are already handled on most processors, at the point at which we know that the exception is going to occur. I don’t know anyone who does this. On current workloads, exceptions are not that important.

—+ “Software Interrupts”

“Software interrupts” are a misnamed instruction usually associated with system calls.

Conceivably, such an instruction could be handled without interrupting the pipeline, predicted like a branch.

However, all of the machines I am familiar with serialize in some way. In my parlance, they do not rename the privilege level.

—+ “Precise Interrupts”, EMON, PEBS

Another poster mentioned precise interrupts.

This is a historical term. On most modern machines interrupts are defined to be precise. Older machines with imprecise interrupts have not been very successful in the market place.

However, there is an alternate meaning, I was involved in introducing: when I got Intel to add the capability to produce an interrupt on performance counter overflow, first using external hardware, and then inside the CPU, it was, in the first few generations, completely imprecise.

E.g. you might set the counter to count the number of instructions retired. The retirement logic (RL)would see the instructions retire, and signal the performance event monitoring circuitry (EMON). It might take two or three clock cycles to send this signal from RL to EMON. EMON would increment the counter, and then see that there was an overflow. The overflow would trigger an interrupt request to the APIC (Advanced Programmable Interrupt Controller). The APIC might take a few cycles
to figure out what was happening, and then signal the retirement logic.

I.e. the EMON interrupt would be signalled imprecisely. Not at the time of the event, but some time thereafter.

Why this imprecision? Well, in 1992-6, performance measurement hardware was not a high priority. We were leveraging existing interrupt hardware. Beggars can’t be choosers.

But furthermore, some performance are intrinsically imprecise. E.g. when do you signal an interrupt for a cache miss on a speculative instruction that never retires? (I have a scheme I called Deferred EMON events, but this is still considered too expensive.) For that matter, what about cache misses on store instructions, where the store is placed into a store buffer, and the instruction has already retired?

I.e. sometimes performance events occur after the instruction they are associated with has committed (retired). Sometimes before. And often not exactly at the instruction they are associated with.

But in all of the implementations so far, as far as I know, these performance events are treated like interrupts: existing instructions in the pipe are flushed.

Now, you can make a performance event precise by treating it like a trap. E.g. if it is an event like instructions retired, you can have the retirement logic trap immediately, instead of taking that circuitous loop I described above. If it occurs earlier in the pipeline, you can have the fact that it occurred marked in the instruction fault status in the ROB (Re-Order Buffer). Something like this is what Intel has done with PEBS (Precise Event Based Sampling). http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf.

However, note that not all events can be sampled using PEBS. For example, PEBS in the example above can count loads that took a cache hit or miss, but not stores (since stores occur later).

So this is like exceptions: the event is delivered only when the instruction retires. Because in a sense the event has not completely occurred – it is a load instruction, that takes a cache miss, and then retires. And instructions after the marked PEBS instruction are flushed from the pipeline.

I hope —+ Late Addition About Early Computers

More Related Contents:

Leave a Comment Cancel reply