Shawn Hargreaves Blog
This seemingly innocuous code:
is the sort of thing that makes graphics driver writers wake in the night, heart pounding, drenched in sweat. Verily 'tis the stuff of nightmares.
The problem is that the GPU runs asynchronously later than the CPU. When the CPU reaches the SetData call and wants to change the contents of the vertex buffer (or index buffer, or texture...) the GPU hasn't yet got around to processing the earlier draw call, so it still needs the previous contents of that buffer. What on earth is a poor driver to do?
So which approach does XNA choose? As of version 4.0:
Prior to version 4.0, things worked the same on Windows, but our Xbox implementation was less awesome:
I'm very happy that we finally found time to implement resource renaming on Xbox, so you can SetData any time you like, regardless of whether predicated tiling is in use, and SetDataOptions.Discard works the same as on other platforms.
The Discard flag is a hint to the driver that you no longer care about any of the data in the resource, so it does not need to bother preserving the existing contents. This can make resource renaming more efficient, because it allows the driver to skip the data copy described in the above step 4.b.
You should specify the Discard flag any time you are calling SetData on a dynamic buffer, and are planning on entirely replacing the contents of that buffer. Even if the current SetData call only changes part of the buffer, if you no longer need the data in the rest of the buffer, this is your chance to let the driver know that. It will love you for giving such a useful hint!
Note that Discard only means "the driver is allowed to throw away the current contents of the buffer if it finds that to be useful". The driver does not HAVE to discard the buffer contents if it does not wish to do so! If the buffer is not currently in use by the GPU, the driver will ignore the Discard hint.
The NoOverwrite flag is a hint to the driver that you are not going to change any part of the resource which the GPU might still be using. This is not enforced, but if you do change data while the GPU is using it, you will get incorrect rendering (typically flickering, but almost anything might happen depending on timing).
You can use the NoOverwrite flag when combining multiple independent pieces of data into a single larger buffer. If you SetData one piece of data into one part of the buffer, then draw using this data, and are now about to SetData a different piece of data into a different part of the buffer, this is your chance to tell the driver that even though you are changing a resource which is still in use by the GPU, you happen to know that the region you are changing is not the same as the region the GPU is using, so there is no need for it to bother stalling or renaming the resource.
Used wisely, the NoOverwrite hint can provide dramatic speed gains. But used incorrectly, it can produce incorrect rendering results. Check out our Particle 3D sample for an example of using it correctly (see the giant comment near the top of the ParticleSystem class), or the way I implemented skidmarks in MotoGP.
A common pattern for games that need to generate lots of dynamic geometry is to use a single dynamic buffer as a circular queue. New geometry is appended to the buffer with NoOverwrite, then drawn, then more geometry is appended again using NoOverwrite, and drawn, rinse, lather, repeat. When you reach the end of the buffer, the position is reset back to the start, and this wrapping SetData switches to Discard mode, which signals the driver to perform a rename and give us a fresh copy of the buffer. This scheme allows any amount of geometry to be efficiently drawn using a single relatively small buffer. The driver will internally allocate however many renamed copies are neccessary to avoid stalling.
const int BufferSize = xxxx;
DynamicVertexBuffer vb = new DynamicVertexBuffer(device, typeof(VertexType), BufferSize, 0);
int currentBufferPosition = 0;
// Add new geometry to the buffer, and return the offset for drawing these vertices.
int AddVerticesToDynamicBuffer(VertexType vertices)
// Append to the existing buffer.
int position = currentBufferPosition;
SetDataOptions hint = SetDataOptions.NoOverwrite;
// If we reached the end, wrap back to the beginning and Discard the existing buffer contents.
if (position + vertices.Length > BufferSize)
position = 0;
hint = SetDataOptions.Discard;
// Write the new data into the buffer.
vb.SetData(position * sizeof(VertexType), vertices, 0, vertices.Length, sizeof(VertexType), hint);
currentBufferPosition = position + vertices.Length;
Internally, SpriteBatch does pretty much exactly this.
I am still not quite clear about the difference between these two approaches:
#1 Lock a vertex buffer with DISCARD, put some data in it, DrawPrimitive, and repeat it
#2 after the previous DrawPrimitive, lock the buffer with NOOVERWRITE, put some new data at the end of previous vertex data, then call DrawPrimitive again.
Mind shade any light for me? Thanks
I've been missing the Discard option on my XBox all the time. Does this mean using dynamic vertex buffers is no longer a no-go on the XBox? I wish there was a similar thing for textures so I could SDO.NoOverwrite-lock them in order to use them as a dynamically updated sprite sheet :)
@Nil: Check out the "Performance Optimizations" page in the DirectX docs, it explains the technique Shawn uses:
I'd be interested in exactly /why/ this technique is being recommended myself, though. It might work fast 4 frames in a row, then produce a small "spike" in frame times when the driver has to copy the contents of the vertex buffer. What's that good for? Is the idea that in a full game with many buffers, these spikes evenly distribute over the frames being drawn, yielding a net gain in performance?
Well, I understand the effect of DISCARD and NOOVERWRITE, what i am not clear about is how to choose between them, in my situation. Here is my approach:
I batched all my meshes with dynamic vertices together, sorted by render state, and i get one big dynamic vertex buffer to do the rendering. I fill it up with data till I need to change the material or transformation, then I draw it with DrawPrimitive. At this point the vertex buffer is usually NOT full, so there are two choice for me to draw the next primitive.
#1 Lock the buffer with DISCARD, and start to fill new data from the beginning. In my case I lock the whole vertex buffer, cuz I try to lock it once with as much space as possible for incoming data, only unlock before I need to draw.
#2 Lock the rest of buffer with NOOVERWRITE, and fill new data from the end of old data.
I have been wondering what is the difference between these two ways. I mean is there any speed penalty, or waste of precious AGP memory? Oh, this question is mostly based on DirectX, but I am also interested in how it will work on XNA.
This post has shown me some inside information about how driver handles vertex buffer, thanks a lot.
> Does this mean using dynamic vertex buffers is no longer a no-go on the XBox?
> It might work fast 4 frames in a row, then produce a small "spike" in frame times when the driver has to copy the contents of the vertex buffer.
As long as you always use either NoOverwrite or Discard, the driver never has to copy the contents of the buffer. Resource renaming is only expensive if you specify neither option, in which case a data copy is required. Discard renames are just a memory allocation, and good drivers use much cleverness to minimize the expense of such things.
> I have been wondering what is the difference between these two ways. I mean is there any speed penalty, or waste of precious AGP memory?
It depends whether you want the driver to continue using the same buffer it is now, or to perform a rename and allocate a new copy of the buffer. The latter will obviously require more memory, but is necessary if you do not have any unused space (that has never been rendered by the GPU) in the existing copy of the buffer.
Most often, you should use the circular buffer pattern like I described above.
Too many renames are inefficient, and the driver has to reallocate the full size of the buffer even if you only put 4 verts into it. Hence, nooverwrite into a large buffer is more efficient. My goal is to do about 1 discard per frame in general.
Shawn: how does this interact with vertex buffers not having heterogenous formats in 4? Do I need one dynamic vb per vertex declaration?
> Too many renames are inefficient, and the driver has to reallocate the full size of the buffer even if you only put 4 verts into it. Hence, nooverwrite into a large buffer is more efficient. My goal is to do about 1 discard per frame in general.
This is mostly about memory rather than time. Renaming resources is very efficient, so it makes little difference whether you perform many renames on a small buffer, or just a few renames on a larger buffer. The big inefficiency is if you perform many renames on a large buffer (discarding after putting just 4 verts into it) which can waste a lot of memory.
> how does this interact with vertex buffers not having heterogenous formats in 4? Do I need one dynamic vb per vertex declaration?
"When the CPU reaches the SetData call and wants to change the contents of the vertex buffer (or index buffer, or texture...)"
Does SetData on Texture2D stall the pipeline?
> Does SetData on Texture2D stall the pipeline?
Any time you SetData on a resource that the GPU is still using, the driver will have to pick one of the four options that I described above. Textures are no different from vertex buffers or index buffers in this regard.
On a game I wrote called Organon, I used a vertex texture to input positions into the vertex shader, I may have used more than one texture alternately, I can't remember, it was 1280x720 with no antialising and didn't crash or have performance problems, but I worried about that and so now if I upgraded it to XNA Framework 4.0 there should be less likelihood of problems. Thank you.
I'm trying to understand the purpose of the circular queue design pattern that you outlined.
Is this any different from a double buffer? I assume that the purpose of the circular queue is that you can use the NoOverwrite flag for fast writing, and then use a different portion of the data each frame, so that the GPU is never using data that you are writing to.
But isn't that identical to a double buffer, and just swap which buffer you are writing/reading to each frame that the data changes (assuming that it takes no more than 1 frame to finish the setdata call)?
And on that note, why do you need to use the Discard flag with the circular queue pattern? Can't you just wrap around to the front of the queue and keep using NoOverwrite?
I guess I must be misinterpreting how the vertex buffer will be used for drawing.
Double buffering and always using NoOverwrite mode would only work if
a) Your vertex buffer is big enough that you never run out of space within it inside a single frame
b) The GPU is always finished with drawing the previous frame by the time the CPU starts running draw code for the next-but-one frame
But neither of those things are neccessarily true...