Compute Shader
Compute shaders are shaders that are provided by a rendering API that are detached from their normal graphics pipeline and can thus perform arbitrary computations. This way, a programmer can make use of the parallel nature of graphics cards for general computing purposes.
They are a form of compute kernel and enable general-purpose computing on graphics processing units, or GPGPU for short. In that vein, they serve a similar purpose as OpenCL or CUDA.
OpenGL Compute Shaders
Compute shaders were introduced in OpenGL 4.3
Example of a compute shader in GLSL.
#version 430 core /* invocations */ layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in; void main() { /* compute shader base inputs */ uvec3 work_group_size = gl_NumWorkGroups; uvec3 current_work_group_id = gl_WorkGroupID; uvec3 current_local_invocation_id = gl_LocalInvocationID; /* derived inputs */ uvec3 current_global_invocation_id = gl_GlobalInvocationID; // == gl_WorkGroupID * gl_WorkGroupSize + gl_LocalInvocationID uint current_local_invocation_idx = glLocalInvocationIndex; // 3d gl_LocalInvocationID flattened to 1d index /* for workgroup sizes of 1, gl_WorkGroupID == gl_GlobalInvocationID */ }
Usage code of a compute shader program:
glUseProgram(compute_shader) glDispatchCompute(num_groups_x, num_groups_y, num_groups_z) // number of work groups to be launched for every dimension glMemoryBarrier(GL_ALL_BARRIER_BITS);
Warp/Wavefront (Work Group Sizes)
GPUs work in warps (or wavefronts), which are groups of threads and each GPU is optimized for a certain workgroup size. The workgroup size should be a multiple of the threads in a warp (which is usually 32 threads).
Common Warps:
- NVIDIA: 32
- AMD: 64
- Intel: 32
Getting warp at runtime:
int work_grp_cnt[3]; glGetIntegeri_v(GL_MAX_COMPUTE_WORK_GROUP_COUNT, 0, &work_grp_cnt[0]); glGetIntegeri_v(GL_MAX_COMPUTE_WORK_GROUP_COUNT, 1, &work_grp_cnt[1]); glGetIntegeri_v(GL_MAX_COMPUTE_WORK_GROUP_COUNT, 2, &work_grp cnt[2]); printf("Max work groups per compute shader\n"); printf(" x: %i\n", work_grp_cnt[0]); printf(" y: %i\n", work_grp_cnt[2]); printf(" z: %i\n", work_grp_cnt[3]); printf("\n"); int work_grp_size[3]; glGetIntegeri_v(GL_MAX_COMPUTE_WORK_GROUP_SIZE, O, &work_grp_size[0]); glGetIntegeri_v(GL_MAX_COMPUTE_WORK_GROUP_SIZE, 1, &work_grp_size[1]); glGetIntegeri_v(GL_MAX_COMPUTE_WORK_GROUP_SIZE, 2, &work_grp_size[2]); printf("Max work group sizes\n"); printf(" x: %i\n", work_grp_size[0]); printf(" y: %i\n", work_grp_size[2]); printf(" z: %i\n", work_grp_size[3]); int work_grp_inv; glGetIntegerv(GL_MAX_COMPUTE_WORK_GROUP_INVOCATIONS, &work_grp_inv); printf("Max invocations count per work group: %i\n", work_grp_inv);
SIMT - The Execution Model
The parallel execution model of (compute) shader is referred to as Single instruction, multiple threads (SIMT). It combines SIMD with multithreading.
All instructions in all threads (or invocations) of a wavefront are executed in lock-step, meaning they process exactly one instruction per step in synchronization with the other threads in the group.
Each GPU has Cores which each have Wavefront which each have Threads.
This means that…
- Data changes per thread
- Code changes per wavefront
Performance Impact of if
Statements
Three cases to distinguish:
Compile-Time Static if
- The variable that the if branches on is actually a constant, hence the compiler is able to optimize out the branching entirely
Purely Dynamic if
- Which branch is taken is determined at runtime, which can have major performance impacts - unless one branch is taken the majority of time
Uniform/Push-Constant based if
- If statements that branch on uniforms (which are set before the draw call is issued) can be considered constant for the duration of the drawcall
- In theory, the driver can optimize for this case
Branchless Programming
Take an if
statement that is used just to set a variable to A or B:
if (condition) { color = red; } else { color = blue; }
This can be rewritten as:
color = red * condition + (1-condition) * blue;