Frequency of shader invocations in rendering commands

Each shader stage has its own frequency of invocations. I will use the OpenGL terminology, but D3D works the same way (since they’re both modelling the same hardware relationships).

Vertex Shaders

These are the second most complicated. They execute once for every input vertex… kinda. If you are using non-indexed rendering, then the ratio is exactly 1:1. Every input vertex will execute on a separate vertex shader instance.

If you are using indexed rendering, then it gets complicated. It’s more-or-less 1:1, each vertex having its own VS invocation. However, thanks to post-T&L caching, it is possible for a vertex shader to be executed less than once per input vertex.

See, a vertex shader’s execution is assumed to create a 1:1 mapping between input vertex data and output vertex data. This means if you pass identical input data to a vertex shader (in the same rendering command), your VS is expected to generate identical output data. So if the hardware can detect that it is about to execute a vertex shader on the same input data that it has used previously, it can skip that execution and simply use the outputs from the previous execution. Assuming it has those values lying around, such as in a cache.

Hardware detects this by using the vertex’s index (which is why it doesn’t work for non-indexed rendering). If the same index is provided to a vertex shader, it is assumed that the shader will get all of the same input values, and therefore will generate the same output values. So the hardware will cache output values based on indices. If an index is in the post-T&L cache, then the hardware will skip the VS’s execution and just use the output values.

Instancing only slightly complicates post-T&L caching. Rather than caching solely on the vertex index, it caches based on the index and instance ID. So it only uses the cached data if both values are the same.

So generally, VS’s execute once for every vertex, but if you optimize your geometry with indexed data, it can execute fewer times. Sometimes much fewer, depending on how you do it.

Tessellation Control Shaders

Or Hull Shaders in D3D parlance.

The TCS is very simple in this regard. It will execute exactly once for each vertex in each patch of the rendering command. No caching or other optimizations are done here.

Tessellation Evaluation Shaders

Or Domain Shaders in D3D parlance.

The TES executes after the tessellation primitive generator has generated new vertices. Because of that, how frequently it executes will obviously depend on your tessellation parameters.

The TES takes vertices generated by the tessellator and outputs vertices. It does so in a 1:1 ratio.

But similar to Vertex Shaders, it is not necessarly 1:1 for each vertex in each of the output primitives. Like a VS, the TES is assumed to provide a direct 1:1 mapping between locations in the tessellated primitives and output parameters. So if you invoke a TES multiple times with the same patch location, it is expected to output the same value.

As such, if generated primitives share vertices, the TES will often only be invoked once for such shared vertices. Unlike vertex shaders, you have no control over how much the hardware will utilize this. The best you can do is hope that the generation algorithm is smart enough to minimize how often it calls the TES.

Geometry Shaders

A Geometry Shader will be invoked once for each point, line or triangle primitive, either directly given by the rendering command or generated by the tessellator. So if you render 6 vertices as unconnected lines, your GS will be invoked exactly 3 times.

Each GS invocation can generate zero or more primitives as output.

The GS can use instancing internally (in OpenGL 4.0 or Direct3D 11). This means that, for each primitive that reaches the GS, the GS will be invoked X times, where X is the number of GS instances. Each such invocation will get the same input primitive data (with a special input value used to distinguish between such instances). This is useful for more efficiently directing primitives to different layers of layered framebuffers.

Fragment Shaders

Or Pixel Shaders in D3D parlance. Even though they aren’t pixels yet, may not become pixels, and they can be executed multiple times for the same pixel 😉

These are the most complicated with regard to invocation frequency. How often they execute depends on a lot of things.

FS’s must be executed at least once for each pixel-sized area that a primitive rasterizes to. But they may be executed more than that.

In order to compute derivatives for texture functions, one FS invocation will often borrow values from one of its neighboring invocation. This is problematic if there is no such invocation, if a neighbor falls outside of the boundary of the primitive being rasterized.

In such cases, there will still be a neighboring FS invocation. Even though it produces no actual data, it still exists and still does work. The good part is that these helper invocations don’t hurt performance. They’re basically using up shader resources that would have otherwise gone unusued. Also, any attempt by such helper invocations to actually output data will be ignored by the system.

But they do still technically exist.

A less transparent issue revolves around multisampling. See, multisampling implementations (particularly in OpenGL) are allowed to decide on their own how many FS invocations to issue. While there are ways to force multisampled rendering to create an FS invocation for every sample, there is no guarantee that implementations will execute the FS only once per covered pixel outside of these cases.

For example, if I recall correctly, if you create a multisample image with a high sample count on certain NVIDIA hardware (8 to 16 or something like that), then the hardware may decide to execute the FS multiple times. Not necessarily once per sample, but once for every 4 samples or so.

So how many FS invocations do you get? At least one for every pixel-sized area covered by the primitive being rasterized. Possibly more if you’re doing multisampled rendering.

Compute Shaders

The exact number of invocations that you specify. That is, the number of work groups you dispatch * the number of invocations per group specified by your CS (your local group count). No more, no less.