Interrupting an assembly instruction while it is operating

Question

Yes all “normal” ISAs including 8080 and x86 guarantee that instructions are atomic with respect to interrupts on the same core. Either an instruction has fully executed and all its architectural effects are visible (in the interrupt handler), or none of them are. Any deviations from this rule are generally carefully documented.

For example, Intel’s x86 manual vol.3 (~1000 page PDF) does make a point of specifically saying this:

6.6 PROGRAM OR TASK RESTART
To allow the restarting of program or task following the handling of an exception or an interrupt, all exceptions
(except aborts) are guaranteed to report exceptions on an instruction boundary. All interrupts are guaranteed to be
taken on an instruction boundary.

An old paragraph in Intel’s vol.1 manual talks about single-core systems using cmpxchg without a lock prefix to read-modify-write atomically (with respect to other software, not hardware DMA access).

The CMPXCHG instruction is commonly used for testing and modifying semaphores. It checks to see if a semaphore
is free. If the semaphore is free, it is marked allocated; otherwise it gets the ID of the current owner. This is all done
in one uninterruptible operation [because it’s a single instruction]. In a single-processor system, the CMPXCHG instruction eliminates the need to
switch to protection level 0 (to disable interrupts) before executing multiple instructions to test and modify a semaphore.

For multiple processor systems, CMPXCHG can be combined with the LOCK prefix to perform the compare and
exchange operation atomically. (See “Locked Atomic Operations” in Chapter 8, “Multiple-Processor Management,”
of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A, for more information on atomic
operations.)

(For more about the lock prefix and how it’s implemented vs. non-locked add [mem], 1, see Can num++ be atomic for ‘int num’?)

As Intel points out in that first paragraph, one way to achieve multi-instruction atomicity is to disable interrupts, then re-enable when you’re done. This is better than using a mutex to protect a larger integer, especially if you’re talking about data shared between the main program and an interrupt handler. If an interrupt happens while the main program holds the lock, it can’t wait for the lock to be release; that would never happen.

Disabling interrupts is usually pretty cheap on simple in-order pipelines, or especially microcontrollers. (Sometimes you need to save the previous interrupt state, instead of unconditionally enabling interrupts. E.g. a function that might be called with interrupts already disabled.)

Anyway, disabling interrupts is how you could atomically do something with a 64-bit integer on 8080.

A few long-running instructions are interruptible, according to rules documented for that instruction.

e.g. x86’s rep-string instructions, like rep movsb (single-instruction memcpy of arbitrary size) are architecturally equivalent to repeating the base instruction (movsb) RCX times, decrementing RCX each time and incrementing or decrementing the pointer inputs (RSI and RDI). An interrupt arriving during a copy can set RCX starting_value - byte_copied and (if RCX is then non-zero) leave RIP pointing to the instruction, so on resuming after the interrupt the rep movsb will run again and do the rest of the copy.

Other x86 examples include SIMD gather loads (AVX2/AVX512) and scatter stores (AVX512). E.g. vpgatherdd ymm0, [rdi + ymm1*4], ymm2 does up to 8 32-bit loads, according to which elements of ymm2 are set. And the results are merged into ymm0.

In the normal case (no interrupts, no page faults or other synchronous exceptions during the gather), you get the data in the destination register, and the mask register ends up zeroed. The mask register thus gives the CPU somewhere to store progress.

Gather and scatter are slow, and might need to trigger multiple page faults, so for synchronous exceptions this guarantees forward progress even under pathological conditions where handling a page fault unmaps all other pages. But more relevantly, it means avoiding redoing TLB misses if a middle element page faults, and not discarding work if an async interrupt arrives.

Some other long-running instructions (like wbinvd which flushes all data caches across all cores) are not architecturally interruptible, or even microarchitecturally abortable (to discard partial work and go handle an interrupt). It’s privileged so user-space can’t execute it as a denial-of-service attack causing high interrupt latency.

Related example of documenting funny behaviour is when x86 popad goes off the top of the stack (segment limit). This is for an exception (not an external interrupt), documented earlier in the vol.3 manual, in section 6.5 EXCEPTION CLASSIFICATIONS (i.e. fault / trap / abort, see the PDF for more details.)

NOTE
One exception subset normally reported as a fault is not restartable. Such exceptions result in loss
of some processor state. For example, executing a POPAD instruction where the stack frame
crosses over the end of the stack segment causes a fault to be reported. In this situation, the
exception handler sees that the instruction pointer (CS:EIP) has been restored as if the POPAD
instruction had not been executed. However, internal processor state (the general-purpose
registers) will have been modified. Such cases are considered programming errors. An application
causing this class of exceptions should be terminated by the operating system.

Note that this is only if popad itself causes an exception, not for any other reason. An external interrupt can’t split popad the way it can for rep movsb or vpgatherdd

(I guess for the purposes of popad faulting, it effectively works iteratively, popping 1 register at a time and logically modifying RSP/ESP/SP as well as the target register. Instead of checking the whole region it’s going to load for segment limit before starting, because that would require an extra add, I guess.)

Out-of-order CPUs roll back to the retirement state on interrupts.

CPUs like modern x86 with out-of-order execution and splitting complex instructions into multiple uops still ensure this is the case. When an interrupt arrives, the CPU has to pick a point between two instructions it’s in the middle of running as the location where the interrupt architecturally happens. It has to discard any work that’s already done on decoding or starting to execute any later instructions. Assuming the interrupt returns, they’ll be re-fetched and start over again executing.

See When an interrupt occurs, what happens to instructions in the pipeline?.

As Andy Glew says, current CPUs don’t rename the privilege level, so what logically happens (interrupt/exception handler executes after earlier instructions finish) matches what actually happens.

Fun fact, though: x86 interrupts aren’t fully serializing, at least not guaranteed on paper. (In x86 terminology, instructions like cpuid and iret are defined as serializing; drain the OoO back-end and store buffer, and anything else that might possibly matter. That’s a very strong barrier and lots of other things aren’t, e.g. mfence.)

In practice (because CPUs don’t in practice rename the privilege level), there won’t be any old user-space instructions/uops in the out-of-order back-end still in flight when an interrupt handler runs.

Async (external) interrupts may also drain the store buffer, depending on how we interpret the wording of Intel’s SDM vol.3 11.10: *the
contents of the store buffer are always drained to memory in the following situations:” … “When an exception or interrupt is generated“. Clearly that applies to exceptions (where the CPU core itself generates the interrupt), and might also mean before servicing an interrupt.

(Store data from retired store instructions is not speculative; it definitely will happen, and the CPU has already dropped the state it would need to be able to roll back to before that store instruction. So a large store buffer full of scattered cache-miss stores can hurt interrupt latency.
Either from waiting for it to drain before any interrupt-handler instructions can run at all, or at least before any in/out or locked instruction in an ISR can happen if it turns out that the store buffer isn’t drained.)

Related: Sandpile (https://www.sandpile.org/x86/coherent.htm) has a table of things that are serializing. Interrupts and exceptions aren’t. But again, this doesn’t mean they don’t drain the store buffer. This would be testable with an experiment: look for StoreLoad reordering between a store in user-space and a load (of a different shared variable) in an ISR, as observed by another core.

Part of this section doesn’t really belong in this answer and should be moved somewhere else. It’s here because discussion in comments on What happens to expected memory semantics (such as read after write) when a thread is scheduled on a different CPU core? cited this as a source for the probably wrong claim that interrupts don’t drain the store buffer, which I wrote after misinterpreting “not serializing”.

Out-of-order CPUs roll back to the retirement state on interrupts.

More Related Contents:

Leave a Comment Cancel reply