This document is to complement the original Expression Encoder 4 SP1 GPU Encoding white paper and discuss the changes that SP2 introduced on this subject. The latest service pack includes a major revision of the CUDA H.264 encoder as well as the introduction of Intel Sandy Bridge QSV support, which both will be discussed in depth below.
Please note that an updated performance report for Cuda GPU Encoding as well as a new one for QSV GPU Encoding are available.
To Maximize Performance
As before, we recommend applying the proper GPU for video processing load. We provide below basic guidelines for optimizing your performance using a GPU to encode. Using proper GPU hardware and a conservative "GPU Stream" setting can reduce the chances of maxing out the GPU resources. Another way to reduce the GPU memory footprint is by lowering the "Number of Reference Frames". GPU-Z is a great freeware to monitor the load and memory usage.
To calculate the appropriate number of GPU streams setting for a specific scenario, we also created a simple performance tool. More information is available here.
Finally, since available GPU memory is more important than ever, we recommend closing any other applications to minimize GPU memory usage outside of Encoder.
To Maximize Quality
In collaboration with our partners at Main Concept, a lot of work was done on the CUDA H.264 encoder to optimize the performance of the GPU, as well as improving output quality and adding new encode functionality like true CBR and HRD support. Those changes have significantly impacted its behavior, especially in terms of performance:
In other words, you will experience better quality and less CPU load, at the cost of higher GPU load and some overall performance loss.
Finally, because the CUDA encoder doesn't support 2-pass encodings, they have been disabled. Please note that while Encoder will revert to software encoding if a 2-pass encode is selected, the new 1-pass H.264 VBR encoding options are fully supported in both CUDA and QSV-based encoding.
Because of the new behavior, we now recommend using only higher-end CUDA GPUs for this feature. A rough suggestion would be to get CUDA hardware with a minimum of 200 CUDA cores per HD stream to be encoded via CUDA and at the least 1 GB of video memory (2 GB or more preferred). It means that for encoding a single HD stream on the GPU, at the least a GeForce GTX 550 Ti or Quadro 2000 GPU is recommended, while encoding 2 or more HD streams on the GPU would require at the least a GeForce GTX 570, Quadro 6000 or Tesla C2050 card. As in Encoder SP1, multiple video cards can be used to speed up Smooth Streaming encodes. Note that we also changed our default of number of GPU streams to 2, which can be tweaked to take full advantage of the hardware used.
Since the GPU is taking more of the load, matching the CPU may not be as important for single stream encodes. Because most current CUDA GPUs can't take more that 2-3 HD streams at a time, a powerful CPU is still required for HD Smooth Streaming encodes.
Finally, make sure to install the latest Nvidia drivers available from Nvidia's website here. Note that the GPU driver version is required to be higher than 267.91 for SP2.
What is QSV?
Intel Quick Sync Video (QSV) is a GPU acceleration technology available on some Intel processors that can dramatically increase the speed of transcoding media files. It is available on most 2nd generation i3, i5 and i7 processors (aka "Sandy Bridge"), as well as some newer Xeon CPUs, like the E3-1285. More information about QSV is available on Intel's website. Please note that the newly Sandy Bridge E CPU line (example: i7-3960X) does not have an embedded GPU, and thus does not support QSV.
QSV encoding support has been integrated in Encoder 4 Pro SP2, which can be used in a similar fashion as CUDA encoding to speed up MP4 and H.264-based Smooth Streaming encode operations.
QSV GPU encoding works the same way than CUDA encoding as explained in the Encoder SP1 GPU Encoding white paper. Because the two GPU-based H.264 encoders behave very differently, Expression Encoder 4 Pro SP2 provides a way to control the number of streams independently between CUDA and QSV, which are defaulting to 2 and 8 respectively (see below). As with CUDA, QSV encoding is turned on by default in the Encoder application and turned off in the SDK.
As for the SDK, the QSV device is enumerated in the H264EncodeDevices.Devices along with the CUDA devices. A new H264EncodeDevice property, DeviceType of type H264DeviceType, provides a way to differentiate between the two GPU types. H264EncodeDevices.MaxNumberGPUStream keeps controlling the number of streams to be encoded by CUDA, while H264EncodeDevices.MaxNumberIntelHDGraphicsStream controls the streams encoded by QSV.
Multi GPU Encoding
Both CUDA and QSV encoding can be used together, either on one multi-stream encode or multiple instances of encoder, which may provide even better performance results. While encoding multiple streams on a PC with both technologies, the streams will be allocated in a snake pattern starting with the Intel GPU (1-2-2-1, 1-2-3-3-2-1) from the highest resolution stream down.
It's worth mentioning that, when used with a discrete GPU, Intel QSV requires the Intel GPU to be enabled as default video adapter in the BIOS and needs to be active and connected to a monitor. Note that LucidLogix Virtu technology, bundled with some Sandy Bridge PCs, enables virtualization of the discrete GPU through the Intel GPU, enabling the usage of both QSV and Cuda and using only one output. More information is available on LucidLogix' website.
Like the CUDA implementation, the QSV supports only a subset of encode settings. For instance, those encoding features are not supported:
QSV encoding is not supported on XP.
An active Intel HD Graphics GPU is required to be accessible. This means that:
As with CUDA, QSV encoding is an approximation of what the software-only based encoding can provide. Output quality differences, especially at low bitrates should be expected. Because of this, we recommend using the software H.264 encoder if highest quality results are a priority over higher performance.
The integrated Intel Accelerator 3000 GPU, found in high-end mobile CPUs and in the K line of 2nd generation i5 and i7 based desktops as well as on some Xeon and i3 CPUs, is highly recommended. See the Intel website for more details.
Because memory is shared, we highly recommend having a minimum of 4 GB for single stream encoding and 8 GB or more for Smooth Streaming and/or parallel encoding. Running out of system memory will greatly affect performance.
It is highly recommended to install the very latest drivers directly from the Intel website to ensure best stability and reliability.
While the QSV encoding can take a large load of concurrent streams, it is worth noting that a blend of hardware and software encoding will likely result in better performance in most Smooth Streaming cases.
A few issues were found in the recent sets of Intel drivers currently available. While all of them have been investigated and resolved, some of the fixes didn't make it into the latest drivers, but will be available in the next driver version (ETA early 2012). Here is the short list:
OpenCL and DirectCompute technologies were evaluated but deemed not ready for integration in Expression Encoder 4 Pro SP2. We are planning to continue monitoring them and investigating potential integration in the future.