First of all, this was a session sponsored and given by ATI so obviously what is mentioned here is geared towards their hardware but the general information can also be beneficial on other cards as well. I'll keep the information here more in form of main key points as this was essentialy the way the presentation went anyways.

Main points:
- All cards are still bandwidth still limited. Move work towards math as it speeds up faster than memory with each new genration.
- GPUs are moving toward less straight-line chips to agile chips. (for longhorn) but right now chips like big batches.
- Games should now aim high with high-end targets 1600x1200, 85Hz, 6xAA.
- Where is the bottlneck in your games??? Keep in mind of the variability of systems when determining where the bottle necks are.
- What do you need to Tune for?
   * Cache (Vertex Fectch, Vertex, texture, Z). All independant of each other and writing one never invalidates the other.
   * Vertex/Pixel shaders
   * Z-Buffer
   * Frame buffer

PRE-VS Cache:
- Pure memory cache (32 byte lines), accessible to all vertex fetches. Data size alignment is a HUGE win! 40 is much worse than 64!!
- Divided amongst streams.  10%+ hit for using streams. Not too bad for sequential access.
- Compress vertex data can help you fit within cache lines.
- High end hardware has generally more than enough vertex processing power. Doing work there is almost free!
- HLSL is your best approach!! Don't hand optimize assembly.
- Most pipes do one op per clock per pipe. (sometimes you'll even get two because of 5D processor). Mask out unused channels so you can have a vector and scalar op at the same time.

Pos-VS cache:
- Gives you free triangles if using indexed primitives as re-used vertices will not go through the shader twice.
- FIFO cache means triangle indices must have coherency.
- use D3DXOptimizeMesh which asks the driver as to how best to optimize the mesh.
- Cache is flushed between DrawPrim calls.

Z-Bufer:
- Still draw front to back (avoid useless work as the hardware willl early reject masked pixels).
- For really complex scenes do a Z pass first. In other words, fill the Z buffer first with color writes disabled and then render the scene. The hardware can go twice as fast when color writes are turned off.
- Clear the Z and stencil every frame. But use the clear function and not a manual fill to take advantage of Z-Buffer compression.
- ATI can speed up Z when: Color write off and AA enabled.
- oDepth, changing Z compare mode and alpha test messes up the depth buffer compression.

Pixel Shader:
- Blending has a cost as it forces the hardware to read from the frame buffer. Turn it off when not needed.
- ddx/ddy (gradient instructions) operate on a 2x2 pixel basis so essentially low resolution.
- Texture Cache is relatively small (8-32k) and is partitioned over all active textures. Wrecked by random access (i.e: try to avoid things such a really noisy bumpmaps). Can sometimes be better to do multi-pass than multi-texture shader as it keeps the number of simultanious textures lower.
- Also note that the cache contains uncompressed textures.
- Better at rejecting pixels. Keep as much work out of pixel shaders. Most apps are pixel limited.
- Short shader better than long one.
- On high end, about 4x ALU power than texture. And will increase. So you can usually have 4+ math instruction per texture fetch without additional costs.
- Trilinear and aniso cost 2-4 times more as far as texture fectes go.
- Processor is only 4D, no scalar benefits. But still mask out operations so it can run ops in parallel.
- Early allocated surfaces are faste rthan later ones. Especially for frame and z-bufers.