Archive for March, 2010

The quest for graphics performance: part II

I’d like to talk a bit about the archi­tec­ture I’ve using to effi­ciently ren­der the video stream in Lightspark. As often hap­pens the key in high per­for­mance com­put­ing is using the right tools for each job. First of all video decod­ing and ren­der­ing are asyn­chro­nous and exe­cuted by dif­fer­ent threads.

Decod­ing itself it’s done by the widely known FFM­peg, no spe­cial tricks are played here. So the start­ing con­di­tion of the opti­mized fast path is a decoded frame data struc­ture. The data struc­ture is short lived and it is over­writ­ten by the next decoded frame, so it must be copied to a more sta­ble buffer. The decod­ing thread main­tains a short array of decoded frames ready to be ren­dered, to account for vari­ance in the decod­ing delay. The decoded frame is in YUV420 for­mat, this means that res­o­lu­tion of color data is one half of the res­o­lu­tion of the lumi­nance chan­nel. The data is returned by FFm­peg as 3 dis­tinct buffers, one for each of the YUV chan­nels, so we actu­ally save 3 buffers per frame. This copy is nec­es­sary and it’s the only one that will be done on the data.

Ren­der­ing is done using a tex­tured quad and tex­ture data is loaded using OpenGL Pixel buffer objects (PBOs). PBOs are mem­ory buffers man­aged by the GL and it’s pos­si­ble to load tex­ture data from them. Unfor­tu­nately they must be explic­itly mapped to the client address space to be accessed, and unmapped when the updated. The advan­tage is that data trans­fer between PBOs and video or tex­ture mem­ory will be done by the GL using asyn­chro­nous DMA trans­fers. Using 2 PBOs it’s pos­si­ble to guar­an­tee a con­tin­u­ous stream of data to video mem­ory: while one PBOs is being copied to tex­ture mem­ory by DMA, new data is been com­puted and trans­ferred to the other using the CPU. This usage pat­tern is called stream­ing tex­tures.

In this case such data is the next frame, taken from the decoded frames buffer. Tex­tures data for OpenGL must be pro­vided in pla­nar form. So we must pack a 1-buffer-per-channel frame in a sin­gle buffer. This can be done in a zero-copy fash­ion using instruc­tion pro­vided by the SSE2 exten­sion. Data is loaded in 128 bit chunks from each of the Y, U and V chan­nels, then using reg­is­ter only oper­a­tions it is cor­rectly packed and padded. Results are writ­ten back using non-temporal moves. This means that the proces­sor may feel free to post­pone the actual com­mit­ment of data to mem­ory, for exam­ple to exploit burst trans­fers on the bus. If we ever want to be sure that the changes has been com­mit­ted in mem­ory we have to call the sfence instruc­tion. For more infor­ma­tion see the Intel ref­er­ence man­u­als on movapd, movntpd, sfence, pun­pcklb.

The result is a sin­gle buffer with the for­mat YUV0, padding has been added to increase tex­ture trans­fer effi­ciency, as the video cards inter­nally works with 32-bit data any­way. The des­ti­na­tion buffer is one of the PBOs so, at the end of the con­ver­sion rou­tine, data will be trans­ferred to video mem­ory using DMA.

Using the stream­ing tex­ture tech­nique and SSE2 data pack­ing we man­aged to effi­ciently move the frame data to tex­ture mem­ory, but it’s still in YUV for­mat. Con­ver­sion to the RGB color space is basi­cally a lin­ear alge­bra oper­a­tion, so it’s ideal to offload this com­pu­ta­tion to a pixel shader.

No Comments

The quest for graphics performance: part II

I’d like to talk a bit about the archi­tec­ture I’ve using to effi­ciently ren­der the video stream in Lightspark. As often hap­pens the key in high per­for­mance com­put­ing is using the right tools for each job. First of all video decod­ing and ren­der­ing are asyn­chro­nous and exe­cuted by dif­fer­ent threads.

Decod­ing itself it’s done by the widely known FFM­peg, no spe­cial tricks are played here. So the start­ing con­di­tion of the opti­mized fast path is a decoded frame data struc­ture. The data struc­ture is short lived and it is over­writ­ten by the next decoded frame, so it must be copied to a more sta­ble buffer. The decod­ing thread main­tains a short array of decoded frames ready to be ren­dered, to account for vari­ance in the decod­ing delay. The decoded frame is in YUV420 for­mat, this means that res­o­lu­tion of color data is one half of the res­o­lu­tion of the lumi­nance chan­nel. The data is returned by FFm­peg as 3 dis­tinct buffers, one for each of the YUV chan­nels, so we actu­ally save 3 buffers per frame. This copy is nec­es­sary and it’s the only one that will be done on the data.

Ren­der­ing is done using a tex­tured quad and tex­ture data is loaded using OpenGL Pixel buffer objects (PBOs). PBOs are mem­ory buffers man­aged by the GL and it’s pos­si­ble to load tex­ture data from them. Unfor­tu­nately they must be explic­itly mapped to the client address space to be accessed, and unmapped when the updated. The advan­tage is that data trans­fer between PBOs and video or tex­ture mem­ory will be done by the GL using asyn­chro­nous DMA trans­fers. Using 2 PBOs it’s pos­si­ble to guar­an­tee a con­tin­u­ous stream of data to video mem­ory: while one PBOs is being copied to tex­ture mem­ory by DMA, new data is been com­puted and trans­ferred to the other using the CPU. This usage pat­tern is called stream­ing tex­tures.

In this case such data is the next frame, taken from the decoded frames buffer. Tex­tures data for OpenGL must be pro­vided in pla­nar form. So we must pack a 1-buffer-per-channel frame in a sin­gle buffer. This can be done in a zero-copy fash­ion using instruc­tion pro­vided by the SSE2 exten­sion. Data is loaded in 128 bit chunks from each of the Y, U and V chan­nels, then using reg­is­ter only oper­a­tions it is cor­rectly packed and padded. Results are writ­ten back using non-temporal moves. This means that the proces­sor may feel free to post­pone the actual com­mit­ment of data to mem­ory, for exam­ple to exploit burst trans­fers on the bus. If we ever want to be sure that the changes has been com­mit­ted in mem­ory we have to call the sfence instruc­tion. For more infor­ma­tion see the Intel ref­er­ence man­u­als on movapd, movntpd, sfence, pun­pcklb.

The result is a sin­gle buffer with the for­mat YUV0, padding has been added to increase tex­ture trans­fer effi­ciency, as the video cards inter­nally works with 32-bit data any­way. The des­ti­na­tion buffer is one of the PBOs so, at the end of the con­ver­sion rou­tine, data will be trans­ferred to video mem­ory using DMA.

Using the stream­ing tex­ture tech­nique and SSE2 data pack­ing we man­aged to effi­ciently move the frame data to tex­ture mem­ory, but it’s still in YUV for­mat. Con­ver­sion to the RGB color space is basi­cally a lin­ear alge­bra oper­a­tion, so it’s ideal to offload this com­pu­ta­tion to a pixel shader.

No Comments

Lightspark gets video streaming

Just a brief news. It’s been a long way, and today I’m very proud to announce video stream­ing sup­port for Lightspark, the effi­cient open source flash player. More­over, per­for­mance looks very promis­ing, I’m not going to pub­lish any results right now as I’d like to do some test­ing before. Any­way, Lightspark seems to be out­per­form­ing Adobe’s player by a good extent, at least on linux.

In the next post I’ll talk a bit about some per­for­mance tricks that made it pos­si­ble to reach such result.

, , , ,

1 Comment

Lightspark gets video streaming

Just a brief news. It’s been a long way, and today I’m very proud to announce video stream­ing sup­port for Lightspark, the effi­cient open source flash player. More­over, per­for­mance looks very promis­ing, I’m not going to pub­lish any results right now as I’d like to do some test­ing before. Any­way, Lightspark seems to be out­per­form­ing Adobe’s player by a good extent, at least on linux.

In the next post I’ll talk a bit about some per­for­mance tricks that made it pos­si­ble to reach such result.

, , , ,

1 Comment