Instances and indices and vfetch - oh my!

Instances and indices and vfetch - oh my!

Rate This
  • Comments 3

I'm cross posting this discussion from an internal Microsoft mailing list, because I'm so awesomely cool that I just can't bear the thought of everything I ever wrote not being indexed and archived for posterity :-)

 

My reply to a question about hardware instancing on Xbox 360:

The 360 doesn’t support vertex stream frequency in the same sense as DX9 SM 3.0 uses it. It just provides the vfetch instruction, which you can use to implement all kinds of crazy addressing schemes.

I’m familiar with several good ways to implement instancing using vfetch:

  1. The technique used in our sample
    • Replicate and offset your index data
    • Use index%freq to index into the vertex buffer
    • Use index/freq to select the instance transform from a shader constant array

  2. Using multiple vertex streams
    • Replicate and offset your index data
    • Use index%freq to index into vertex buffer #1
    • Use index/freq to select the instance transform from vertex buffer #2
    • Upside: no longer limited by shader constant registers
    • Downside: instance data now needs to be set into a dynamic VB, so you have to deal with the complexity of managing that to avoid stalls (which can be a pain since Xbox doesn’t support the Discard semantic for SetData)

  3. Without index replication
    • Draw geometry using a non-indexed API call, so the GPU just generates steadily incrementing index values
    • Store your real index values in vertex stream #1
    • Store vertex data in stream #2
    • Store instance data in either constant registers or stream #3
    • Upside: no longer need to replicate any geometry data at all (thus saves memory)
    • Downside: disables post T&L vertex caching (thus increases vertex processing workload)

  4. Store instance data in a texture
    • This could be combined with any of the above schemes
    • Great for animations: you can encode all the frames for all the bones of a skinned animation, plus the current position of each instance, into a single texture

 

Chris Tector suggested a cunning fifth option:

There is 2a: indirect your transform indices. Store a transform index vertex buffer which holds 1 DWORD index of which transform to use on an instance. Then you can avoid the lock stalls by playing dirty and never locking. You write a modified transform to a not in use location in the transform vertex buffer. Then you rewrite the index to point to the newly written transform. You’re relying on atomic updates of the single DWORD transform index. So:

  • Replicate and offset your index data
  • Use index%freq to index into vertex buffer #1
  • Use index/freq to select the instance transform index from vertex buffer #2
  • Use instance transform index to select a transform from vertex buffer #3
  • Upside: no longer limited by shader constant registers and no longer stall prone, with less buffer juggling required
  • Downside: extra vertex fetch indirection, but same values fetched for every vertex so they should stay nice and warm in the vertex cache

Since I haven’t tried it in GS, my question is a more general “loose” multi-threading one. Is this possible? Can I play dirty like that in safe only managed land? I’m guessing no since I don’t ever get the pointer to the VB memory.

 

To which I replied:

That should work in GS. You don’t get a raw pointer to the VB memory, but you can use SetData with the NoOverwrite semantic to update pieces of a dynamic VB without a stall.

  • I seem to remember NoOverwrite also isn't supported on the 360.. Perhaps I am wrong.

    How many frames buffer would be needed on the 360 (with / without vsync?). I'd imagine it would be less than the 2-5 on the PC.

    I'm currently in the process of implementing vfetch myself - and I am on the fence between #2 and a variation of your '2a' technique as well.

    And I'll just say that when your not using Effects, dealing with xbox shaders is a royal pain.

    :-)

  • Some more feedback:

    I've found that using fmod(index,freq) is much more reliable than using index%freq.

  • @StatusUnkown:

    When using index % freq, you can add 0.5 to your index first to avoid rounding errors in the modulus calculation. E.g.:

    int vertexIndex = (index + 0.5) % vertexCount;

Page 1 of 1 (3 items)
Leave a Comment
  • Please add 2 and 8 and type the answer here:
  • Post