DirectComputeFrom OpenVIDIA
(A Very Basic Introduction to) Programming DirectCompute & DirectX 11 Compute ShadersThis page introduces programming with DirectCompute. It is intended to illustrate concepts in DirectCompute programming from the ground up without any prior DirectX programming experience. Some knowledge of GPU based data parallel programming is helpful, namely the concepts of a kernel, grid/thread blocks (in CUDA) or workgroups (in OpenCL). It introduces DirectX 11 Compute Shaders. Hopefully the scope covers topics relevant for using DirectCompute for general purpose GPU Compute. Please contact me regarding any errata or suggestions & comments: email me
Introduction Sample Code
The sample program basically breaks down into the following steps to run a compute shader:
The following discussion goes into these parts in greater detail. Device ManagementDirectCompute is essentially fully introduced through the Compute Shader 5.0 programming model (aka CS 5.0). CS 5.0 however requires DirectX-11. If DirectX-11 hardware isn't available (say at the time of writing this when DirectX-11 hardware is just still around the corner), DirectCompute can still be used either with the driver running in a reference mode (software mode), or in a "downlevel" configuration where some of the functions of the compute shader are exposed via something called "Compute Shader 4.0" and run on DirectX-10 hardware. To do this:
The following code example calls D3D11CreateDevice...() multiple times, trying each time to get a particular driver type (reference or hardware (i.e.:software emulation or the real GPU)), and feature level (i.e. DX10,10.1, or 11). D3D_FEATURE_LEVEL levelsWanted[] =
{
D3D_FEATURE_LEVEL_11_0,
D3D_FEATURE_LEVEL_10_1,
D3D_FEATURE_LEVEL_10_0
};
UINT numLevelsWanted = sizeof( levelsWanted ) / sizeof( levelsWanted[0] );
D3D_DRIVER_TYPE driverTypes[] =
{
D3D_DRIVER_TYPE_REFERENCE,
D3D_DRIVER_TYPE_HARDWARE,
};
UINT numDriverTypes = sizeof( driverTypes ) / sizeof( driverTypes[0] );
// iterate through driver types, try reference driver type first, then software driver
// break on the first success.
// change the orders above to try different configurations
// here, we take D3D 11 in reference mode to demonstrate the API
for( UINT driverTypeIndex = 0; driverTypeIndex < numDriverTypes; driverTypeIndex++ )
{
D3D_DRIVER_TYPE g_driverType = driverTypes[driverTypeIndex];
UINT createDeviceFlags = NULL;
hr = D3D11CreateDevice( NULL, g_driverType, NULL, createDeviceFlags,
levelsWanted, numLevelsWanted, D3D11_SDK_VERSION,
&g_pD3DDevice, &g_D3DFeatureLevel, &g_pD3DContext );
}
After completing successfully, it provides a device pointer, a context pointer, and a feature level. Note: for brevity many of the variables and declarations are omitted - a full sample program should fill in the details. These code fragments should serve as a guide to whats occurring in the program. Choosing which graphics cardUse the IDXGIFactory object to enumerate the number of adapters found on the system, as in the following. First the IDXGIFactory object is created, and then EnumAdapters is called with an integer argument of the adapter being enumerated. If it doesn't exist, it will return DXGI_ERROR_NOT_FOUND. // Get a CUDA capable adapter
std::vector<IDXGIAdapter1*> vAdapters;
IDXGIFactory1* factory;
CreateDXGIFactory1(__uuidof(IDXGIFactory1), (void**)&factory);
IDXGIAdapter1 * pAdapter = 0;
UINT i=0;
while(factory->EnumAdapters1(i, &pAdapter) != DXGI_ERROR_NOT_FOUND)
{
vAdapters.push_back(pAdapter);
++i;
}
Then, when creating the device, pass the desired adapter pointer as the first argument in D3DCreateDevice and set the driver type to D3D_DRIVER_TYPE_UNKNOWN. See the D3D11 documentation from the D3D SDK for details on the D3DCreateDevice function. g_driverType = D3D_DRIVER_TYPE_UNKNOWN;
hr = D3D11CreateDevice( vAdapters[devNum], g_driverType, NULL, createDeviceFlags, levelsWanted,
numLevelsWanted, D3D11_SDK_VERSION, &g_pD3DDevice, &g_D3DFeatureLevel, &g_pD3DContext );
Running Compute ShadersIn DirectCompute, the compute shader is executed with the dispatch function like:
// now dispatch ("run") the compute shader, with a set of 16x16 groups.
g_pD3DContext->Dispatch( 16, 16, 1 );
which dispatchs a set of 16x16 thread groups. Note that the inputs to the shader are considered as "state". That is, you set the state before dispatching the shader, and once dispatched, the "state" determines the input variables. So typically, dispatching a shader will have code that looks like:
pd3dImmediateContext->CSSetShader( ... );
pd3dImmediateContext->CSSetConstantBuffers( ...);
pd3dImmediateContext->CSSetShaderResources( ...); // CS input
// CS output
pd3dImmediateContext->CSSetUnorderedAccessViews( ...);
// Run the CS
pd3dImmediateContext->Dispatch( dimx, dimy, 1 );
and all the constant buffers, buffers etc. seen in the shader are those set up by the CSSet...() calls before the dispatch(). Synchronization with the CPUNote that these calls are asynchonous. The CPU appears to return immediately and the calls execute, in order, when possible. Later, "mapping" a buffer (see buffers section below) will cause the calling CPU thread to block until all previous calls complete.
Events: Basic Profiling & SynchronizationDXCompute provides an event API based on "Queries". You can create , insert, and wait on Query states to determine when your shader (or other asynchronous calls) actually executed. The following example creates a query, waits on a query to ensure everything up to that point has executed, dispatches a shader, then waits on another query for the shader to complete. Create the query object: D3D11_QUERY_DESC pQueryDesc; pQueryDesc.Query = D3D11_QUERY_EVENT; pQueryDesc.MiscFlags = 0; ID3D11Query *pEventQuery; g_pD3DDevice->CreateQuery( &pQueryDesc, &pEventQuery ); Then, insert "fences" into the list of calls. Then wait on it. GetData() returns S_FALSE if the query's information is not yet available. g_pD3DContext->End(pEventQuery); // insert a fence into the pushbuffer
while( g_pD3DContext->GetData( pEventQuery, NULL, 0, 0 ) == S_FALSE ) {} // spin until event is finished
g_pD3DContext->Dispatch(,x,y,1); // launch kernel
g_pD3DContext->End(pEventQuery); // insert a fence into the pushbuffer
while( g_pD3DContext->GetData( pEventQuery, NULL, 0, 0 ) == S_FALSE ) {} // spin until event is finished
Finally the query object can be released with: pEventQuery->Release(); Take care to initialize the query and release it as necessary to avoid having too many queries floating around (such as if you were processing per frame).
Resources in DirectComputeIn DirectX, resources are created by:
Finally,to "hook" them up to your compute shader, call the various CSSet...() functions to set their state and they will be visible inside the shader as the corresponding variables of the same resource type in your shader.
CPU/GPU communicationResources can be copied to and read from by using the gD3DContext->CopyResource() function. This copies between two resources. To copy between the CPU and GPU, create a CPU side "staging" resource. The staging resource can be mapped with a map() call which returns a CPU pointer that you can use to copy data into or read data from the staging resource. Then, unmap() the staging resource, and performe a CopyResource() to or from the GPU resource. CPU/GPU Copying Buffer PerformanceIn CUDA-C, pinned host memory and write combined host memory can be allocated and provide the best performance when copying data from them to the GPU. In DirectCompute, the "usage" specifiers of the buffer determine what kind of memory is allocated, and thus its performance as well.
D3D11 Reference on usage flagsD3D10 reference on usage flags Structured Buffers and Unordered Access ViewsOne important feature in DirectCompute are structured buffers and unordered access views. A structured buffer can be accessed like an array in a compute shader: any thread can read and write any location (aka: scatter & gather). The unordered access view is a mechanism to bind the UAV declared in the calling program to the shader, and allow...unordered access. Declaring a Structured BufferTo declare a structured buffer, use D3D11_RESOURCE_MISC_BUFFER_STRUCTURED. The bind flags below allow unordered access and access from the shader. Default usage is specified meaning it can be read/written to by the GPU but will need to be copied to-from a staging resource if the CPU wants to read/write. // Create Structured Buffer // D3DXVECTOR4 Declared in D3DX10Math.h // http://msdn.microsoft.com/en-us/library/bb205130(VS.85).aspx D3D11_BUFFER_DESC sbDesc; sbDesc.BindFlags =D3D11_BIND_UNORDERED_ACCESS | D3D11_BIND_SHADER_RESOURCE ; sbDesc.Usage =D3D11_USAGE_DEFAULT; sbDesc.CPUAccessFlags =0; sbDesc.MiscFlags =D3D11_RESOURCE_MISC_BUFFER_STRUCTURED ; sbDesc.StructureByteStride =sizeof(D3DXVECTOR4); sbDesc.ByteWidth =sizeof(D3DXVECTOR4) * w * h; hr = g_pD3DDevice->CreateBuffer( &sbDesc, NULL, &pStructuredBuffer ); Declaring a Unordered Access ViewHere we declare an unordered access view. Note that we give it a pointer to the structured buffer.
// Create an Unordered Access View to the Structured Buffer
D3D11_UNORDERED_ACCESS_VIEW_DESC sbUAVDesc;
sbUAVDesc.Buffer.FirstElement =0;
sbUAVDesc.Buffer.Flags =0;
sbUAVDesc.Buffer.NumElements =w * h;
sbUAVDesc.Format =DXGI_FORMAT_UNKNOWN;
sbUAVDesc.ViewDimension =D3D11_UAV_DIMENSION_BUFFER;
hr = g_pD3DDevice->CreateUnorderedAccessView( pStructuredBuffer, &sbUAVDesc,
&g_pStructuredBufferUAV );
Later, before we dispatch the shader, we can activate the structured buffer for use by the shader: g_pD3DContext->CSSetUnorderedAccessViews( 0, 1, &g_pStructuredBufferUAV, &initCounts ); After a dispatch, if using CS 4.x hardware, be sure to unbind it, since only a single UAV can be bound to a pipeline // D3D11 on D3D10 hW: only a single UAV can be bound to a pipeline at once. // set to NULL to unbind ID3D11UnorderedAccessView *pNullUAV = NULL; g_pD3DContext->CSSetUnorderedAccessViews( 0, 1, &pNullUAV, &initCounts ); Constant Buffers in DirectComputeA constant buffer is a set of data in the compute shader that doesn't change during invocation. From graphics, this would have been data like viewing matrices or constant colors. In compute, this could be filter weights for signal or image processing say. To use a constant buffer,
Create a constant buffer. Note, for the size, we know in the HLSL shader file its a 4 element vector.
// Create Constant Buffer
// D3DXVECTOR4 Declared in D3DX10Math.h
// http://msdn.microsoft.com/en-us/library/bb205130(VS.85).aspx
D3D11_BUFFER_DESC cbDesc;
cbDesc.BindFlags = D3D11_BIND_CONSTANT_BUFFER ;
cbDesc.Usage = D3D11_USAGE_DYNAMIC;
// CPU writable, should be updated per frame
cbDesc.CPUAccessFlags = D3D11_CPU_ACCESS_WRITE;
cbDesc.MiscFlags = 0;
cbDesc.ByteWidth = sizeof(D3DXVECTOR4) ;
hr = g_pD3DDevice->CreateBuffer( &cbDesc, NULL, &pConstantBuffer );
if( SUCCEEDED(hr) )
{
MessageBox(NULL,L"Created Constant Buffer",L"Created Buffer", MB_OK);
} else {
MessageBox( NULL, L"Failed Making Constant Buffer", L"Create Buffer", MB_OK );
}
Use the map/unmap interface to send data to the constant buffer. But, often, programmers will define an identical struct on in the CPU program to that in the HLSL program and use the sizeof, then map the buffer to a pointer to the struct to fill in its data. // must use D3D11_MAP_WRITE_DISCARD // http://msdn.microsoft.com/en-us/library/bb205318(VS.85).aspx D3D11_MAPPED_SUBRESOURCE mappedResource; g_pD3DContext->Map( pConstantBuffer, 0, D3D11_MAP_WRITE_DISCARD, 0, &mappedResource ); unsigned int *data = (unsigned int *)(mappedResource.pData); for( int i=0 ; i<4; i++ ) data[i] = i; g_pD3DContext->Unmap( pConstantBuffer, 0 );
// now make the compute shader active g_pD3DContext->CSSetShader( g_pComputeShader, NULL, 0 ); g_pD3DContext->CSSetConstantBuffers( 0 ,1, &pConstantBuffer );
cbuffer consts {
uint4 const_color;
};
The variable (const_color) is now available in your shader code like: uint4 color = uint4( groupID.x, groupID.y , const_color.x, const_color.z ); Multiple Constant BuffersYou can make different groups of constant buffers. Typically, if they're all needing to be updated at the same time, you would simply do something like:
cbuffer consts {
uint4 const_color_0;
uint4 const_color_1;
};
But additionally, you can have more than one constant buffer:
cbuffer consts {
uint4 const_color_0;
};
cbuffer more_consts {
uint4 const_color_1;
};
This is useful if const_color_0 is updated each invocation say, but const_color_1 is only updated every 100 invocations say. To set them differently, create buffers and use the map/unmap as before. Then, when dispatching the compute shader, refer to each of them by their "slot". Their "slot" number is the order in which they appear in the HLSL. g_pD3DContext->CSSetConstantBuffers( 0 ,1, &pConstantBuffer ); g_pD3DContext->CSSetConstantBuffers( 1 ,1, &pVeryConstantBuffer ); g_pD3DContext->CSSetShader( g_pComputeShader, NULL, 0 ); Then, when dispatched, the compute shader pointed to by object g_pComputeShader will be able to access two constant buffers. Compute Shader (CS) HLSL ProgrammingA compute shader is programmed in HLSL. In our examples it lives in a separate text file that is loaded and compiled at runtime. The compute shader is a single program that is executed by many threads. These threads are group together in to "thread groups" which can share data and synchronize each other. Many of these thread groups can be launched by the dispatch() call. For example dispatch(16,16,1) launches a set of 16x16x1 thread groups. The number of threads inside each group is determined inside the shader itself by the syntax: [numthreads( 4, 4, 1)] Placing the number of threads in a group (4,4 in this case) in a #define is handy since it will allow you to use the size inside the body of the shader code. [1] struct BufferStruct
{
uint4 color;
};
// group size
#define thread_group_size_x 4
#define thread_group_size_y 4
RWStructuredBuffer<BufferStruct> g_OutBuff;
/* This is the number of threads in a thread group, 4x4x1 in this example case */
// e.g.: [numthreads( 4, 4, 1 )]
[numthreads( thread_group_size_x, thread_group_size_y, 1 )]
void main( uint3 threadIDInGroup : SV_GroupThreadID, uint3 groupID : SV_GroupID, uint groupIndex : SV_GroupIndex, uint3 dispatchThreadID : SV_DispatchThreadID )
{
int N_THREAD_GROUPS_X = 16; // assumed equal to 16 in dispatch(16,16,1)
int stride = thread_group_size_x * N_THREAD_GROUPS_X;
// buffer stide, assumes data stride = data width (i.e. no padding)
int idx = dispatchThreadID.y * stride + dispatchThreadID.x;
float4 color = float4(groupID.x, groupID.y, dispatchThreadID.x, dispatchThreadID.y);
g_OutBuff[ idx ].color = color;
}
All threads execute a single function. Each thread has its own unique thread ID within a group and each group has its own ID. These IDs can be used to determine which part of an array to operate on for instance. So by launching as many threads as you have input array elements, you can have each thread operate on a single element in parallel. In particular, the following are available as arguments to your shader entry function:
Questions/FeedbackUntil I get the discussion function worked out, any discussion about the page can take place on the OpenVIDIA Forums For most programming questions, you're best bet is to ask on the forums listed below. Additional Reading
|