Archive for March, 2010
The quest for graphics performance: part II
Posted by Alessandro Pignotti in Coding tricks, Insane Projects on March 16, 2010
I’d like to talk a bit about the architecture I’ve using to efficiently render the video stream in Lightspark. As often happens the key in high performance computing is using the right tools for each job. First of all video decoding and rendering are asynchronous and executed by different threads.
Decoding itself it’s done by the widely known FFMpeg, no special tricks are played here. So the starting condition of the optimized fast path is a decoded frame data structure. The data structure is short lived and it is overwritten by the next decoded frame, so it must be copied to a more stable buffer. The decoding thread maintains a short array of decoded frames ready to be rendered, to account for variance in the decoding delay. The decoded frame is in YUV420 format, this means that resolution of color data is one half of the resolution of the luminance channel. The data is returned by FFmpeg as 3 distinct buffers, one for each of the YUV channels, so we actually save 3 buffers per frame. This copy is necessary and it’s the only one that will be done on the data.
Rendering is done using a textured quad and texture data is loaded using OpenGL Pixel buffer objects (PBOs). PBOs are memory buffers managed by the GL and it’s possible to load texture data from them. Unfortunately they must be explicitly mapped to the client address space to be accessed, and unmapped when the updated. The advantage is that data transfer between PBOs and video or texture memory will be done by the GL using asynchronous DMA transfers. Using 2 PBOs it’s possible to guarantee a continuous stream of data to video memory: while one PBOs is being copied to texture memory by DMA, new data is been computed and transferred to the other using the CPU. This usage pattern is called streaming textures.
In this case such data is the next frame, taken from the decoded frames buffer. Textures data for OpenGL must be provided in planar form. So we must pack a 1-buffer-per-channel frame in a single buffer. This can be done in a zero-copy fashion using instruction provided by the SSE2 extension. Data is loaded in 128 bit chunks from each of the Y, U and V channels, then using register only operations it is correctly packed and padded. Results are written back using non-temporal moves. This means that the processor may feel free to postpone the actual commitment of data to memory, for example to exploit burst transfers on the bus. If we ever want to be sure that the changes has been committed in memory we have to call the sfence instruction. For more information see the Intel reference manuals on movapd, movntpd, sfence, punpcklb.
The result is a single buffer with the format YUV0, padding has been added to increase texture transfer efficiency, as the video cards internally works with 32-bit data anyway. The destination buffer is one of the PBOs so, at the end of the conversion routine, data will be transferred to video memory using DMA.
Using the streaming texture technique and SSE2 data packing we managed to efficiently move the frame data to texture memory, but it’s still in YUV format. Conversion to the RGB color space is basically a linear algebra operation, so it’s ideal to offload this computation to a pixel shader.
The quest for graphics performance: part II
Posted by Alessandro Pignotti in Coding tricks, Insane Projects on March 16, 2010
I’d like to talk a bit about the architecture I’ve using to efficiently render the video stream in Lightspark. As often happens the key in high performance computing is using the right tools for each job. First of all video decoding and rendering are asynchronous and executed by different threads.
Decoding itself it’s done by the widely known FFMpeg, no special tricks are played here. So the starting condition of the optimized fast path is a decoded frame data structure. The data structure is short lived and it is overwritten by the next decoded frame, so it must be copied to a more stable buffer. The decoding thread maintains a short array of decoded frames ready to be rendered, to account for variance in the decoding delay. The decoded frame is in YUV420 format, this means that resolution of color data is one half of the resolution of the luminance channel. The data is returned by FFmpeg as 3 distinct buffers, one for each of the YUV channels, so we actually save 3 buffers per frame. This copy is necessary and it’s the only one that will be done on the data.
Rendering is done using a textured quad and texture data is loaded using OpenGL Pixel buffer objects (PBOs). PBOs are memory buffers managed by the GL and it’s possible to load texture data from them. Unfortunately they must be explicitly mapped to the client address space to be accessed, and unmapped when the updated. The advantage is that data transfer between PBOs and video or texture memory will be done by the GL using asynchronous DMA transfers. Using 2 PBOs it’s possible to guarantee a continuous stream of data to video memory: while one PBOs is being copied to texture memory by DMA, new data is been computed and transferred to the other using the CPU. This usage pattern is called streaming textures.
In this case such data is the next frame, taken from the decoded frames buffer. Textures data for OpenGL must be provided in planar form. So we must pack a 1-buffer-per-channel frame in a single buffer. This can be done in a zero-copy fashion using instruction provided by the SSE2 extension. Data is loaded in 128 bit chunks from each of the Y, U and V channels, then using register only operations it is correctly packed and padded. Results are written back using non-temporal moves. This means that the processor may feel free to postpone the actual commitment of data to memory, for example to exploit burst transfers on the bus. If we ever want to be sure that the changes has been committed in memory we have to call the sfence instruction. For more information see the Intel reference manuals on movapd, movntpd, sfence, punpcklb.
The result is a single buffer with the format YUV0, padding has been added to increase texture transfer efficiency, as the video cards internally works with 32-bit data anyway. The destination buffer is one of the PBOs so, at the end of the conversion routine, data will be transferred to video memory using DMA.
Using the streaming texture technique and SSE2 data packing we managed to efficiently move the frame data to texture memory, but it’s still in YUV format. Conversion to the RGB color space is basically a linear algebra operation, so it’s ideal to offload this computation to a pixel shader.
Lightspark gets video streaming
Posted by Alessandro Pignotti in Insane Projects on March 15, 2010
Just a brief news. It’s been a long way, and today I’m very proud to announce video streaming support for Lightspark, the efficient open source flash player. Moreover, performance looks very promising, I’m not going to publish any results right now as I’d like to do some testing before. Anyway, Lightspark seems to be outperforming Adobe’s player by a good extent, at least on linux.
In the next post I’ll talk a bit about some performance tricks that made it possible to reach such result.
Lightspark gets video streaming
Posted by Alessandro Pignotti in Insane Projects on March 15, 2010
Just a brief news. It’s been a long way, and today I’m very proud to announce video streaming support for Lightspark, the efficient open source flash player. Moreover, performance looks very promising, I’m not going to publish any results right now as I’d like to do some testing before. Anyway, Lightspark seems to be outperforming Adobe’s player by a good extent, at least on linux.
In the next post I’ll talk a bit about some performance tricks that made it possible to reach such result.