From OpenVIDIA

Jump to: navigation, search


(A Very Basic Introduction to) Programming DirectCompute & DirectX 11 Compute Shaders

This page introduces programming with DirectCompute. It is intended to illustrate concepts in DirectCompute programming from the ground up without any prior DirectX programming experience. Some knowledge of GPU based data parallel programming is helpful, namely the concepts of a kernel, grid/thread blocks (in CUDA) or workgroups (in OpenCL). It introduces DirectX 11 Compute Shaders. Hopefully the scope covers topics relevant for using DirectCompute for general purpose GPU Compute.

Please contact me regarding any errata or suggestions & comments: email me

  • 5-19-2010: MSDN's D3D11 Reference is online now, adding better links for material in the text
  • 4-10-2010: Added notes on what the sample program does
  • 9-15-2009: Currently under development, this page is incomplete but hopefully is a good complementary resource to other sources on the web.

Introduction Sample Code

  • Full sample code This link contains the full sample code (.cpp) for a minimal console DirectCompute program. The example is console based and doesn't use any graphics or windowing.
    • This program shows a round trip of data to the GPU, through a shader, and back to CPU memory. The output is a element (4 components) of a buffer written by the compute shader that contains grid location and the data from a constant buffer, set by the CPU.
  • Compute Shader code This link contains the full shader code (.hlsl)

The sample program basically breaks down into the following steps to run a compute shader:

  1. Initialize the device and context
  2. Load the shader from a HLSL file and compile it
  3. Create and initialize resources (buffers etc) for the shader
  4. Set shader state, and execute
  5. Readback the results.

The following discussion goes into these parts in greater detail.

Device Management

DirectCompute is essentially fully introduced through the Compute Shader 5.0 programming model (aka CS 5.0). CS 5.0 however requires DirectX-11. If DirectX-11 hardware isn't available (say at the time of writing this when DirectX-11 hardware is just still around the corner), DirectCompute can still be used either with the driver running in a reference mode (software mode), or in a "downlevel" configuration where some of the functions of the compute shader are exposed via something called "Compute Shader 4.0" and run on DirectX-10 hardware. To do this:

  • write your program using the DirectX-11 API (i.e. ID3D11...calls)
  • Create a DX11 device, but request a DX10 and CS4.0 feature level when creating it.

The following code example calls D3D11CreateDevice...() multiple times, trying each time to get a particular driver type (reference or hardware (i.e.:software emulation or the real GPU)), and feature level (i.e. DX10,10.1, or 11).

   D3D_FEATURE_LEVEL levelsWanted[] = 
   UINT numLevelsWanted = sizeof( levelsWanted ) / sizeof( levelsWanted[0] );
   D3D_DRIVER_TYPE driverTypes[] =
   UINT numDriverTypes = sizeof( driverTypes ) / sizeof( driverTypes[0] );

   // iterate through driver types, try reference driver type first, then software driver 
   // break on the first success.
   // change the orders above to try different configurations
   // here, we take D3D 11 in reference mode to demonstrate the API
   for( UINT driverTypeIndex = 0; driverTypeIndex < numDriverTypes; driverTypeIndex++ )
       D3D_DRIVER_TYPE g_driverType = driverTypes[driverTypeIndex];
       UINT createDeviceFlags = NULL;
       hr = D3D11CreateDevice( NULL, g_driverType, NULL, createDeviceFlags, 
           levelsWanted, numLevelsWanted, D3D11_SDK_VERSION, 
           &g_pD3DDevice, &g_D3DFeatureLevel, &g_pD3DContext );

After completing successfully, it provides a device pointer, a context pointer, and a feature level.

Note: for brevity many of the variables and declarations are omitted - a full sample program should fill in the details. These code fragments should serve as a guide to whats occurring in the program.

Choosing which graphics card

Use the IDXGIFactory object to enumerate the number of adapters found on the system, as in the following. First the IDXGIFactory object is created, and then EnumAdapters is called with an integer argument of the adapter being enumerated. If it doesn't exist, it will return DXGI_ERROR_NOT_FOUND.

   // Get a CUDA capable adapter
   std::vector<IDXGIAdapter1*> vAdapters;
   IDXGIFactory1* factory;
   CreateDXGIFactory1(__uuidof(IDXGIFactory1), (void**)&factory);
   IDXGIAdapter1 * pAdapter = 0;
   UINT i=0;
   while(factory->EnumAdapters1(i, &pAdapter) != DXGI_ERROR_NOT_FOUND) 

Then, when creating the device, pass the desired adapter pointer as the first argument in D3DCreateDevice and set the driver type to D3D_DRIVER_TYPE_UNKNOWN. See the D3D11 documentation from the D3D SDK for details on the D3DCreateDevice function.

   g_driverType = D3D_DRIVER_TYPE_UNKNOWN;
   hr = D3D11CreateDevice( vAdapters[devNum], g_driverType, NULL, createDeviceFlags, levelsWanted, 
               numLevelsWanted, D3D11_SDK_VERSION, &g_pD3DDevice, &g_D3DFeatureLevel, &g_pD3DContext );

Running Compute Shaders

In DirectCompute, the compute shader is executed with the dispatch function like:

	// now dispatch ("run") the compute shader, with a set of 16x16 groups.
	g_pD3DContext->Dispatch( 16, 16, 1 );

which dispatchs a set of 16x16 thread groups.

Note that the inputs to the shader are considered as "state". That is, you set the state before dispatching the shader, and once dispatched, the "state" determines the input variables. So typically, dispatching a shader will have code that looks like:

    pd3dImmediateContext->CSSetShader( ... );
    pd3dImmediateContext->CSSetConstantBuffers( ...);
    pd3dImmediateContext->CSSetShaderResources( ...);  // CS input
    // CS output
    pd3dImmediateContext->CSSetUnorderedAccessViews( ...);
    // Run the CS
    pd3dImmediateContext->Dispatch( dimx, dimy, 1 );

and all the constant buffers, buffers etc. seen in the shader are those set up by the CSSet...() calls before the dispatch().

Synchronization with the CPU

Note that these calls are asynchonous. The CPU appears to return immediately and the calls execute, in order, when possible. Later, "mapping" a buffer (see buffers section below) will cause the calling CPU thread to block until all previous calls complete.

If you just want to know when a kernel has finished running, e.g. for basic profiling, you should use an EVENT query:

Events: Basic Profiling & Synchronization

DXCompute provides an event API based on "Queries". You can create , insert, and wait on Query states to determine when your shader (or other asynchronous calls) actually executed. The following example creates a query, waits on a query to ensure everything up to that point has executed, dispatches a shader, then waits on another query for the shader to complete.

Create the query object:

 D3D11_QUERY_DESC pQueryDesc;
 pQueryDesc.Query = D3D11_QUERY_EVENT;
 pQueryDesc.MiscFlags = 0;
 ID3D11Query *pEventQuery;
 g_pD3DDevice->CreateQuery( &pQueryDesc, &pEventQuery );

Then, insert "fences" into the list of calls. Then wait on it. GetData() returns S_FALSE if the query's information is not yet available.

 g_pD3DContext->End(pEventQuery); // insert a fence into the pushbuffer
 while( g_pD3DContext->GetData( pEventQuery, NULL, 0, 0 ) == S_FALSE ) {} // spin until event is finished
 g_pD3DContext->Dispatch(,x,y,1); // launch kernel
 g_pD3DContext->End(pEventQuery); // insert a fence into the pushbuffer
 while( g_pD3DContext->GetData( pEventQuery, NULL, 0, 0 ) == S_FALSE ) {} // spin until event is finished

Finally the query object can be released with:


Take care to initialize the query and release it as necessary to avoid having too many queries floating around (such as if you were processing per frame).

D3D11 Queries MSDN D3D9 Reference on Queries

Resources in DirectCompute

In DirectX, resources are created by:

  1. first creating a resource descriptor and setting its values to what is wanted to be created. The resource descriptor is a structure containing flags and information about the resource to be created.
  2. Then, some "create()" function is called, which takes this descriptor as an argument and creates the resource.

Finally,to "hook" them up to your compute shader, call the various CSSet...() functions to set their state and they will be visible inside the shader as the corresponding variables of the same resource type in your shader.

CPU/GPU communication

Resources can be copied to and read from by using the gD3DContext->CopyResource() function. This copies between two resources. To copy between the CPU and GPU, create a CPU side "staging" resource. The staging resource can be mapped with a map() call which returns a CPU pointer that you can use to copy data into or read data from the staging resource. Then, unmap() the staging resource, and performe a CopyResource() to or from the GPU resource.

CPU/GPU Copying Buffer Performance

In CUDA-C, pinned host memory and write combined host memory can be allocated and provide the best performance when copying data from them to the GPU. In DirectCompute, the "usage" specifiers of the buffer determine what kind of memory is allocated, and thus its performance as well.

  • D3D11_USAGE_STAGING. These resources are system memory so they can be read/written directly by the GPU. But they’re only allowed to be used as src/dest of a copy (CopyResource(), CopySubresourceRegion()), they can’t be accessed directly by a shader.
    • If the resource is created with the D3D11_CPU_ACCESS_WRITE flag it will provide good CPU->GPU performance;
    • If you use D3D11_CPU_ACCESS_READ it’ll be CPU-cached with lower performance (but allows for readback)
    • READ takes precedence over WRITE if you use both.
  • D3D11_USAGE_DYNAMIC (for Buffer resources only, not Texture* resources). Use for fast CPU->GPU memory transfers. They can be used as copy src/dest, and can also be read as textures (“ShaderResourceViews” in D3D-speak) from shaders. They can’t be written from a shader. They’re versioned by the driver -- each time you Map() with the DISCARD flag the driver will return a new chunk of memory if the previous version is still in use by the GPU, rather than stalling until the GPU finishes. They’re meant for streaming data to the GPU.

D3D11 Reference on usage flagsD3D10 reference on usage flags

Structured Buffers and Unordered Access Views

One important feature in DirectCompute are structured buffers and unordered access views. A structured buffer can be accessed like an array in a compute shader: any thread can read and write any location (aka: scatter & gather). The unordered access view is a mechanism to bind the UAV declared in the calling program to the shader, and allow...unordered access.

Declaring a Structured Buffer

To declare a structured buffer, use D3D11_RESOURCE_MISC_BUFFER_STRUCTURED. The bind flags below allow unordered access and access from the shader. Default usage is specified meaning it can be read/written to by the GPU but will need to be copied to-from a staging resource if the CPU wants to read/write.

	// Create Structured Buffer
	// D3DXVECTOR4 Declared in D3DX10Math.h
	D3D11_BUFFER_DESC sbDesc;
	sbDesc.Usage	=D3D11_USAGE_DEFAULT;
	sbDesc.CPUAccessFlags	=0;
	sbDesc.StructureByteStride	=sizeof(D3DXVECTOR4);
	sbDesc.ByteWidth		=sizeof(D3DXVECTOR4) * w * h;

	hr = g_pD3DDevice->CreateBuffer( &sbDesc, NULL, &pStructuredBuffer );

Declaring a Unordered Access View

Here we declare an unordered access view. Note that we give it a pointer to the structured buffer.

   // Create an Unordered Access View to the Structured Buffer 
   sbUAVDesc.Buffer.FirstElement    =0;
   sbUAVDesc.Buffer.Flags    =0;
   sbUAVDesc.Buffer.NumElements    =w * h;
   sbUAVDesc.ViewDimension	=D3D11_UAV_DIMENSION_BUFFER; 
   hr = g_pD3DDevice->CreateUnorderedAccessView( pStructuredBuffer, &sbUAVDesc,  
       &g_pStructuredBufferUAV );

Later, before we dispatch the shader, we can activate the structured buffer for use by the shader:

 g_pD3DContext->CSSetUnorderedAccessViews( 0, 1, &g_pStructuredBufferUAV, &initCounts );

After a dispatch, if using CS 4.x hardware, be sure to unbind it, since only a single UAV can be bound to a pipeline

 // D3D11 on D3D10 hW: only a single UAV can be bound to a pipeline at once. 
 // set to NULL to unbind
 ID3D11UnorderedAccessView *pNullUAV = NULL;
 g_pD3DContext->CSSetUnorderedAccessViews( 0, 1, &pNullUAV, &initCounts );

Constant Buffers in DirectCompute

A constant buffer is a set of data in the compute shader that doesn't change during invocation. From graphics, this would have been data like viewing matrices or constant colors. In compute, this could be filter weights for signal or image processing say.

To use a constant buffer,

  • create the buffer resource
  • initialize the data using map/unmap (or effects interface)
  • set the constant buffer state using CSSetConstantBuffers

Create a constant buffer. Note, for the size, we know in the HLSL shader file its a 4 element vector.

	// Create Constant Buffer
	// D3DXVECTOR4 Declared in D3DX10Math.h
	D3D11_BUFFER_DESC cbDesc;
	cbDesc.BindFlags		=	D3D11_BIND_CONSTANT_BUFFER ;
	cbDesc.Usage		=	D3D11_USAGE_DYNAMIC;    
                // CPU writable, should be updated per frame
	cbDesc.CPUAccessFlags	=	D3D11_CPU_ACCESS_WRITE;
	cbDesc.MiscFlags		=	0;
	cbDesc.ByteWidth		=	sizeof(D3DXVECTOR4) ;

	hr = g_pD3DDevice->CreateBuffer( &cbDesc, NULL, &pConstantBuffer );
	if( SUCCEEDED(hr) )
		MessageBox(NULL,L"Created Constant  Buffer",L"Created Buffer", MB_OK);
	} else {
		MessageBox( NULL, L"Failed Making Constant Buffer", L"Create Buffer", MB_OK );

Use the map/unmap interface to send data to the constant buffer. But, often, programmers will define an identical struct on in the CPU program to that in the HLSL program and use the sizeof, then map the buffer to a pointer to the struct to fill in its data.

	// must use D3D11_MAP_WRITE_DISCARD
	D3D11_MAPPED_SUBRESOURCE mappedResource;
	g_pD3DContext->Map( pConstantBuffer, 0, D3D11_MAP_WRITE_DISCARD, 0, &mappedResource );
	unsigned int *data = (unsigned int *)(mappedResource.pData);
	for( int i=0 ; i<4; i++ ) data[i] = i;
	g_pD3DContext->Unmap( pConstantBuffer, 0 );

Note that the input variables (in this case constant buffers) to the compute shader are treated like "state" variables, and so before dispatching a compute shader, set up the state using the CSSet...() functions and when the compute shader executes it will utilize the variables that have been set.

	// now make the compute shader active
	g_pD3DContext->CSSetShader( g_pComputeShader, NULL, 0 );
	g_pD3DContext->CSSetConstantBuffers( 0 ,1,  &pConstantBuffer );

Then , in the shader, declare some variable, as a constant buffer variable like:

cbuffer consts {
	uint4 const_color;

The variable (const_color) is now available in your shader code like:

	uint4 color = uint4( groupID.x, groupID.y , const_color.x, const_color.z );

Multiple Constant Buffers

You can make different groups of constant buffers. Typically, if they're all needing to be updated at the same time, you would simply do something like:

cbuffer consts {
    uint4 const_color_0;
    uint4 const_color_1;

But additionally, you can have more than one constant buffer:

cbuffer consts {
  uint4 const_color_0;
cbuffer more_consts {
  uint4 const_color_1;

This is useful if const_color_0 is updated each invocation say, but const_color_1 is only updated every 100 invocations say.

To set them differently, create buffers and use the map/unmap as before. Then, when dispatching the compute shader, refer to each of them by their "slot". Their "slot" number is the order in which they appear in the HLSL.

 g_pD3DContext->CSSetConstantBuffers( 0 ,1,  &pConstantBuffer );
 g_pD3DContext->CSSetConstantBuffers( 1 ,1,  &pVeryConstantBuffer );
 g_pD3DContext->CSSetShader( g_pComputeShader, NULL, 0 );

Then, when dispatched, the compute shader pointed to by object g_pComputeShader will be able to access two constant buffers.

Compute Shader (CS) HLSL Programming

A compute shader is programmed in HLSL. In our examples it lives in a separate text file that is loaded and compiled at runtime. The compute shader is a single program that is executed by many threads. These threads are group together in to "thread groups" which can share data and synchronize each other.

Many of these thread groups can be launched by the dispatch() call. For example


launches a set of 16x16x1 thread groups.

The number of threads inside each group is determined inside the shader itself by the syntax:

 [numthreads( 4, 4, 1)]

Placing the number of threads in a group (4,4 in this case) in a #define is handy since it will allow you to use the size inside the body of the shader code. [1]

 struct BufferStruct
 	uint4 color;
 // group size
 #define thread_group_size_x 4
 #define thread_group_size_y 4
 RWStructuredBuffer<BufferStruct> g_OutBuff;
 /* This is the number of threads in a thread group, 4x4x1 in this example case */
 // e.g.: [numthreads( 4, 4, 1 )]
 [numthreads( thread_group_size_x, thread_group_size_y, 1 )]
 void main( uint3 threadIDInGroup : SV_GroupThreadID, uint3 groupID : SV_GroupID, uint groupIndex : SV_GroupIndex,     uint3 dispatchThreadID : SV_DispatchThreadID )
   int N_THREAD_GROUPS_X = 16;  // assumed equal to 16 in dispatch(16,16,1)
   int stride = thread_group_size_x * N_THREAD_GROUPS_X;  
   // buffer stide, assumes data stride = data width (i.e. no padding)
   int idx = dispatchThreadID.y * stride + dispatchThreadID.x;
   float4 color = float4(groupID.x, groupID.y, dispatchThreadID.x, dispatchThreadID.y);
   g_OutBuff[ idx ].color = color;

All threads execute a single function. Each thread has its own unique thread ID within a group and each group has its own ID. These IDs can be used to determine which part of an array to operate on for instance. So by launching as many threads as you have input array elements, you can have each thread operate on a single element in parallel.

In particular, the following are available as arguments to your shader entry function:

  • uint3 threadIDInGroup : SV_GroupThreadID (ID within the group, in each dimension
  • uint3 groupID : SV_GroupID, (ID of the group, in each dimension of the dispatch)
  • uint groupIndex : SV_GroupIndex (flattened ID of the group in one dimension if you counted like a raster)
  • uint3 dispatchThreadID : SV_DispatchThreadID (ID of the thread within the entire dispatch in each dimension)


Until I get the discussion function worked out, any discussion about the page can take place on the OpenVIDIA Forums

For most programming questions, you're best bet is to ask on the forums listed below.

Additional Reading

This page was last modified on 19 May 2010, at 22:37. This page has been accessed 56,821 times.