Compute Shader

Compute shaders are shaders that are provided by a rendering API that are detached from their normal graphics pipeline and can thus perform arbitrary computations. This way, a programmer can make use of the parallel nature of graphics cards for general computing purposes.

They are a form of compute kernel and enable general-purpose computing on graphics processing units, or GPGPU for short. In that vein, they serve a similar purpose as OpenCL or CUDA.

OpenGL Compute Shaders

Compute shaders were introduced in OpenGL 4.3

Example of a compute shader in GLSL.

#version 430 core

/* invocations */
layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
void main()
{
    /* compute shader base inputs */
    uvec3 work_group_size              = gl_NumWorkGroups;
    uvec3 current_work_group_id        = gl_WorkGroupID;
    uvec3 current_local_invocation_id  = gl_LocalInvocationID;

    /* derived inputs */
    uvec3 current_global_invocation_id = gl_GlobalInvocationID;  // == gl_WorkGroupID * gl_WorkGroupSize + gl_LocalInvocationID
    uint  current_local_invocation_idx = glLocalInvocationIndex; // 3d gl_LocalInvocationID flattened to 1d index

    /* for workgroup sizes of 1, gl_WorkGroupID == gl_GlobalInvocationID */
}

Usage code of a compute shader program:

glUseProgram(compute_shader)
glDispatchCompute(num_groups_x, num_groups_y, num_groups_z) // number of work groups to be launched for every dimension
glMemoryBarrier(GL_ALL_BARRIER_BITS);

Warp/Wavefront (Work Group Sizes)

GPUs work in warps (or wavefronts), which are groups of threads and each GPU is optimized for a certain workgroup size. The workgroup size should be a multiple of the threads in a warp (which is usually 32 threads).

Common Warps:

NVIDIA: 32
AMD: 64
Intel: 32

Getting warp at runtime:

int work_grp_cnt[3];
glGetIntegeri_v(GL_MAX_COMPUTE_WORK_GROUP_COUNT, 0, &work_grp_cnt[0]);
glGetIntegeri_v(GL_MAX_COMPUTE_WORK_GROUP_COUNT, 1, &work_grp_cnt[1]);
glGetIntegeri_v(GL_MAX_COMPUTE_WORK_GROUP_COUNT, 2, &work_grp cnt[2]);
printf("Max work groups per compute shader\n");
printf("   x: %i\n", work_grp_cnt[0]);
printf("   y: %i\n", work_grp_cnt[2]);
printf("   z: %i\n", work_grp_cnt[3]);
printf("\n");

int work_grp_size[3];
glGetIntegeri_v(GL_MAX_COMPUTE_WORK_GROUP_SIZE, O, &work_grp_size[0]);
glGetIntegeri_v(GL_MAX_COMPUTE_WORK_GROUP_SIZE, 1, &work_grp_size[1]);
glGetIntegeri_v(GL_MAX_COMPUTE_WORK_GROUP_SIZE, 2, &work_grp_size[2]);
printf("Max work group sizes\n");
printf("   x: %i\n", work_grp_size[0]);
printf("   y: %i\n", work_grp_size[2]);
printf("   z: %i\n", work_grp_size[3]);

int work_grp_inv;
glGetIntegerv(GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS, &work_grp_inv);
printf("Max invocations count per work group: %i\n", work_grp_inv);

SIMT - The Execution Model

The parallel execution model of (compute) shader is referred to as Single instruction, multiple threads (SIMT). It combines SIMD with multithreading.

All instructions in all threads (or invocations) of a wavefront are executed in lock-step, meaning they process exactly one instruction per step in synchronization with the other threads in the group.

Each GPU has Cores which each have Wavefront which each have Threads.

This means that…

Data changes per thread
Code changes per wavefront

Performance Impact of `if` Statements

Three cases to distinguish:

Compile-Time Static if

The variable that the if branches on is actually a constant, hence the compiler is able to optimize out the branching entirely

Purely Dynamic if

Which branch is taken is determined at runtime, which can have major performance impacts - unless one branch is taken the majority of time

Uniform/Push-Constant based if

If statements that branch on uniforms (which are set before the draw call is issued) can be considered constant for the duration of the drawcall
In theory, the driver can optimize for this case

Branchless Programming

Take an if statement that is used just to set a variable to A or B:

if (condition) { color = red;  }
else           { color = blue; }

This can be rewritten as:

color = red * condition + (1-condition) * blue;

andersch.dev

Compute Shader

OpenGL Compute Shaders

Warp/Wavefront (Work Group Sizes)

SIMT - The Execution Model

Performance Impact of `if` Statements

Branchless Programming

Resources

Compute Shader

OpenGL Compute Shaders

Warp/Wavefront (Work Group Sizes)

SIMT - The Execution Model

Performance Impact of if Statements

Branchless Programming

Resources

Performance Impact of `if` Statements