Posts Tagged optimization

Lightspark gets video streaming

Just a brief news. It’s been a long way, and today I’m very proud to announce video stream­ing sup­port for Lightspark, the effi­cient open source flash player. More­over, per­for­mance looks very promis­ing, I’m not going to pub­lish any results right now as I’d like to do some test­ing before. Any­way, Lightspark seems to be out­per­form­ing Adobe’s player by a good extent, at least on linux.

In the next post I’ll talk a bit about some per­for­mance tricks that made it pos­si­ble to reach such result.

, , , ,

1 Comment

The quest for graphics performance: part I

lightspark-tech-demo2-revamp

Devel­op­ing and opti­miz­ing Lightspark, the mod­ern Flash player, I’m greatly expand­ing my knowl­edge and under­stand­ing of GPU inter­nals. In the last few days I’ve man­aged to find out a cou­ple of nice tricks that boosted up per­for­mance and, as nice side effects, saved CPU time and added features :-)

First of all, I’d like to intro­duce a bit of the Lightspark graph­ics architecture

The project is designed from the ground up to make use of the fea­tures offered by mod­ern graph­ics hard­ware. Namely 3D accel­er­a­tion and pro­gram­ma­ble shaders. The Flash file for­mat encodes the geome­tries to be drawn as set of edges. This rep­re­sen­ta­tion is quite dif­fer­ent from the one under­stood by GPUs. So the geome­tries are first tri­an­gu­lated (reduced to a set of tri­an­gles). This oper­a­tion is done on the CPU and is quite com­pu­ta­tion­ally inten­sive, but the results are cached, so over­all this does not hit performance.

More­over Flash offer sev­eral dif­fer­ent fill styles that should be applied on geom­e­try, for exam­ple solid color and var­i­ous kind of gra­di­ents. Lightspark han­dles all those pos­si­bil­i­ties using a sin­gle frag­ment shader, a lit­tle piece of code that is invoked on every pixel to com­pute the desired color. Of course the shader has to know about the cur­rent fill style. This infor­ma­tion along with sev­eral other para­me­ters could be passed with dif­fer­ent meth­ods. More on this on the next issue.

There is one pecu­liar thing about the shader though, let’s look at a sim­ple pseudo code:
gl_FragColor=solid_color()*selector[0]+linear_gradient()*selector[1]+circular_gradient()*selector[2]...;

Selec­tor is a binary array, the only allowed val­ues are zero or one. More­over only one value is one. This means that the cur­rent frag­ment (pixel) color is com­puted for every pos­si­ble fill style and only after­ward the cor­rect result is selected. This may look like  a waste of com­put­ing power, but it is actu­ally more effi­cient than some­thing like this:

if(selector[0])
       gl_FragColor=solid_color();
else if(selector[1])
       gl_FragColot=linear_gradient();
...

This counter intu­itive fact comes from the nature of the graph­ics hard­ware. GPUs are very dif­fer­ent from CPUs and are capa­ble of cruch­ing tons of vec­to­r­ial oper­a­tions blind­ingly fast. But they totally fall down on their knees when encoun­ter­ing branches in the code. This is actu­ally quite com­mon on deeply pipelined archi­tec­ture which misses com­plex branch pre­dic­tion cir­cuitry, not only GPUs but also num­ber crunch­ing devices and mul­ti­me­dia mon­sters such as IBM Cell. Keep this in mind when work­ing on such platforms.

, , , , , , ,

2 Comments

ActionScript meets LLVM: part I

One of the major chal­lenges in the design of lightspark is the Action­Script exe­cu­tion engine. Most of the more recent flash con­tent is almost com­pletely build out of the Action­Script tech­nol­ogy, which with ver­sion 3.0 matured enough to become a foun­da­tional block of the cur­rent, and prob­a­bly future web. The same tech­nol­ogy is going to become also wide­spread offline if the Adobe AIR plat­form suc­ceedes as a cross plat­form appli­ca­tion framework.

But what is Action­Script? Basi­cally it is an almost ECMAscript com­pla­iant lan­guage; the spec­i­fi­ca­tion cov­ers the lan­guage itself, a huge library of com­po­nents and the byte­code for­mat that is used to deliver code to the clients, usu­ally as part of a SWF (flash) file.

The byte­code mod­els a stack-machine as most of the argu­ments are passed on the stack and not as operands in the code. This oper­a­tional descrip­tion — although quite dense — requires a lot of stack traf­fic, even for sim­ple com­pu­ta­tions. It should be noted that mod­ern x86/amd64 proces­sors employ spe­cific stack trac­ing units to opti­mize out such traf­fic, but this is highly archi­tec­ture depen­dent and not guaranteed.

LLVM (which stands fot Low-Level Vir­tual Machine) is on the other hand based on an Inter­me­di­ate Lan­guage in SSA form. This means that each sym­bol can be assigned only one time. This form is extremely use­ful when doing opti­miza­tion over the code. LLVM offers a nice inter­face for a bunch of fea­ture, most notably sophis­ti­cated opti­miza­tion of the code and Just-In-Time com­pi­la­tion to native assemply.

The chal­lenge is: how to exploit llvm power to build a fast Action­Script engine.

The answer is, as usual, a mat­ter of com­pro­mises. Quite a lot of com­mon usage pat­terns of the stack-machine can be heav­ily opti­mized with lim­ited work, for exam­ple most of the data pushed on the stack is going to be used right away! More details on this on the next issue...

, , , , , ,

1 Comment

Case Study: Real Time video encoding on Via Epia, Part II

Once upon a time, there was the glo­ri­ous empire of DOS. It was a mighty and fear­ful time, when peo­ple could still talk to the heart of the machines and write code in the for­got­ten lan­guage of assem­bly. We are now lucky enough to have pow­er­ful com­pil­ers that do most of this low level work for us, and hand craft­ing assem­bly code is not needed any­omore. But the intro­duc­tion of SIMD (Sin­gle Instruc­tion Mul­ti­ple Data) within the x86 instruc­tion set made this ancient abil­ity use­ful again.

MMX/SSE are a very pow­er­ful and dan­ger­ous beast. We had almost no pre­vi­ous expe­ri­ence with low level assem­bly pro­gram­ming. And a crit­i­cal prob­lem to solve: how to con­vert from RGB col­or­space to YUV, and do it fast on our very lim­ited board.
As I’ve wrote on the pre­vi­ous arti­cle the con­ver­sion is con­cep­tu­ally sim­ple and it’s basi­cally a 3x3 matrix mul­ti­pli­ca­tion. That’s it, do 9 scalar prod­ucts and you’re done!

vect11

SIMD instruc­tions oper­ate on packed data: this means that more than one (usu­ally 2/4) value is stored in a sin­gle reg­is­ter and oper­a­tions on them are par­al­lelized. For exam­ple you can do four sums with a sin­gle oper­a­tion.
Unfor­tu­nately MMX/SSE is a ver­ti­cal instruc­tion set. This means you can do very lit­tle with the data packed in a sin­gle reg­is­ter. There are how­ever instruc­tions that do ‘half a scalar prod­uct’. We found out an approach to max­i­mize the through­put using this.

Our cam­era, a Point­grey Bum­ble­bee, deliv­ers raw sen­sor data via Firewire, arranged in a pat­tern called Bayer Encod­ing. Color data is arranged in 2x2 cells, and there are twice the sen­sors for green than for the the other col­ors, since the human eye is more sen­si­ble to that color. We at first rearrange the input data in a strange but use­ful pat­tern, as in pic­ture. The fol­low­ing assem­bler code then does the magic, two pixel at a time.

//Loading mm0 with 0, this will be useful to interleave data byte
pxor %mm0,%mm0
 
//Loading 8 bytes from buffer. Assume %eax contains the address of the input buffer
//One out of four bytes are zeros, but the overhead is well balanced by the aligned memory access.
//Those zeros will also be useful later on
movd (%eax),%mm1 // < R1, G1, B2, 0>
movd 4(%eax),%mm2 // < B1, 0, R2, G2>
//Unpacking bytes to words, MMX registers are 8 bytes wide so we can interleave the data bytes with zeros.
punpcklbw %mm0,%mm1
punpcklbw %mm0,%mm2
 
//We need triple copy of each input, one for each output channel
movq %mm1,%mm3 // < R1, G1, B2, 0>
movq %mm2,%mm4 // < B1, 0, R2, G2>
movq %mm1,%mm5 // < R1, G1, B2, 0>
movq %mm2,%mm6 // < B1, 0, R2, G2>
 
//Multiply and accumulate, this does only half the work.
//We multiply the data with the right constants and sums the results in pair.
//The consts are four packed 16bit values and contains the constants scaled by 32768.
//[YUV]const and [YUV]const_inv are the same apart from being arrenged to suit the layout of the even/odd inputs
pmaddwd Yconst,%mm1 // < Y1*R1 + Y2*G1, Y3*B2 + 0>
pmaddwd Uconst,%mm3 // < U1*R1 + U2*G1, U3*B2 + 0>
pmaddwd Vconst,%mm5 // < V1*R1 + V2*G1, V3*B2 + 0>
 
pmaddwd Yconst_inv,%mm2 // < Y3*B1 + 0, Y1*R2 + Y2*G2>
pmaddwd Uconst_inv,%mm4 // < U3*B1 + 0, U1*R2 + U2*G2>
pmaddwd Vconst_inv,%mm6 // < V3*B1 + 0, V1*R2 + V2*G2>
 
//Add registers in pairs to get the final scalar product. The results are two packed pixel for each output channel and still scaled by 32768
paddd %mm2,%mm1 // < Y1*R1 + Y2*G1 + Y3*B1, Y1*R2, Y2*G2 + Y3*B2>
paddd %mm4,%mm3 // < U1*R1 + U2*G1 + U3*B1, U1*R2, U2*G2 + U3*B2>
paddd %mm6,%mm5 // < V1*R1 + V2*G1 + V3*B1, V1*R2, V2*G2 + V3*B2>
 
//We shift right by 15 bits to get rid of the scaling
psrad $15,%mm1
psrad $15,%mm3
psrad $15,%mm5
 
//const128 is two packed 32bit values, this is the offset to be added to the U/V channnels
//const128:
// .long 128
// .long 128
paddd const128,%mm3
paddd const128,%mm5
 
//We repack the resulting dwords to bytes
packssdw %mm0,%mm1
packssdw %mm0,%mm3
packssdw %mm0,%mm5
 
packuswb %mm0,%mm1
packuswb %mm0,%mm3
packuswb %mm0,%mm5
 
//We copy the byte pairs to the destination buffers, assume %ebx, %esi and %edi contains the address of such buffers
movd %mm1,%ecx
movw %cx,(%ebx)
movd %mm3,%ecx
movb %cl,(%esi)
movd %mm5,%ecx
movb %cl,(%edi)

Sim­ple right? :-)

Cod­ing this was dif­fi­cult but in the end really inter­est­ing. And even more impor­tant, this was really fast and we had no prob­lem using this dur­ing the robot com­pe­ti­tion itself. Read the rest of this entry »

, , , , , , ,

No Comments