Posts Tagged optimization
Lightspark gets video streaming
Posted by Alessandro Pignotti in Insane Projects on March 15, 2010
Just a brief news. It’s been a long way, and today I’m very proud to announce video streaming support for Lightspark, the efficient open source flash player. Moreover, performance looks very promising, I’m not going to publish any results right now as I’d like to do some testing before. Anyway, Lightspark seems to be outperforming Adobe’s player by a good extent, at least on linux.
In the next post I’ll talk a bit about some performance tricks that made it possible to reach such result.
The quest for graphics performance: part I
Posted by Alessandro Pignotti in Coding tricks on June 27, 2009
Developing and optimizing Lightspark, the modern Flash player, I’m greatly expanding my knowledge and understanding of GPU internals. In the last few days I’ve managed to find out a couple of nice tricks that boosted up performance and, as nice side effects, saved CPU time and added features
First of all, I’d like to introduce a bit of the Lightspark graphics architecture
The project is designed from the ground up to make use of the features offered by modern graphics hardware. Namely 3D acceleration and programmable shaders. The Flash file format encodes the geometries to be drawn as set of edges. This representation is quite different from the one understood by GPUs. So the geometries are first triangulated (reduced to a set of triangles). This operation is done on the CPU and is quite computationally intensive, but the results are cached, so overall this does not hit performance.
Moreover Flash offer several different fill styles that should be applied on geometry, for example solid color and various kind of gradients. Lightspark handles all those possibilities using a single fragment shader, a little piece of code that is invoked on every pixel to compute the desired color. Of course the shader has to know about the current fill style. This information along with several other parameters could be passed with different methods. More on this on the next issue.
There is one peculiar thing about the shader though, let’s look at a simple pseudo code:
gl_FragColor=solid_color()*selector[0]+linear_gradient()*selector[1]+circular_gradient()*selector[2]...;
Selector is a binary array, the only allowed values are zero or one. Moreover only one value is one. This means that the current fragment (pixel) color is computed for every possible fill style and only afterward the correct result is selected. This may look like a waste of computing power, but it is actually more efficient than something like this:
if(selector[0]) gl_FragColor=solid_color(); else if(selector[1]) gl_FragColot=linear_gradient(); ...
This counter intuitive fact comes from the nature of the graphics hardware. GPUs are very different from CPUs and are capable of cruching tons of vectorial operations blindingly fast. But they totally fall down on their knees when encountering branches in the code. This is actually quite common on deeply pipelined architecture which misses complex branch prediction circuitry, not only GPUs but also number crunching devices and multimedia monsters such as IBM Cell. Keep this in mind when working on such platforms.
ActionScript meets LLVM: part I
Posted by Alessandro Pignotti in Insane Projects on May 17, 2009
One of the major challenges in the design of lightspark is the ActionScript execution engine. Most of the more recent flash content is almost completely build out of the ActionScript technology, which with version 3.0 matured enough to become a foundational block of the current, and probably future web. The same technology is going to become also widespread offline if the Adobe AIR platform succeedes as a cross platform application framework.
But what is ActionScript? Basically it is an almost ECMAscript complaiant language; the specification covers the language itself, a huge library of components and the bytecode format that is used to deliver code to the clients, usually as part of a SWF (flash) file.
The bytecode models a stack-machine as most of the arguments are passed on the stack and not as operands in the code. This operational description — although quite dense — requires a lot of stack traffic, even for simple computations. It should be noted that modern x86/amd64 processors employ specific stack tracing units to optimize out such traffic, but this is highly architecture dependent and not guaranteed.
LLVM (which stands fot Low-Level Virtual Machine) is on the other hand based on an Intermediate Language in SSA form. This means that each symbol can be assigned only one time. This form is extremely useful when doing optimization over the code. LLVM offers a nice interface for a bunch of feature, most notably sophisticated optimization of the code and Just-In-Time compilation to native assemply.
The challenge is: how to exploit llvm power to build a fast ActionScript engine.
The answer is, as usual, a matter of compromises. Quite a lot of common usage patterns of the stack-machine can be heavily optimized with limited work, for example most of the data pushed on the stack is going to be used right away! More details on this on the next issue...
Case Study: Real Time video encoding on Via Epia, Part II
Posted by Alessandro Pignotti in Coding tricks on February 6, 2009
Once upon a time, there was the glorious empire of DOS. It was a mighty and fearful time, when people could still talk to the heart of the machines and write code in the forgotten language of assembly. We are now lucky enough to have powerful compilers that do most of this low level work for us, and hand crafting assembly code is not needed anyomore. But the introduction of SIMD (Single Instruction Multiple Data) within the x86 instruction set made this ancient ability useful again.
MMX/SSE are a very powerful and dangerous beast. We had almost no previous experience with low level assembly programming. And a critical problem to solve: how to convert from RGB colorspace to YUV, and do it fast on our very limited board.
As I’ve wrote on the previous article the conversion is conceptually simple and it’s basically a 3x3 matrix multiplication. That’s it, do 9 scalar products and you’re done!
SIMD instructions operate on packed data: this means that more than one (usually 2/4) value is stored in a single register and operations on them are parallelized. For example you can do four sums with a single operation.
Unfortunately MMX/SSE is a vertical instruction set. This means you can do very little with the data packed in a single register. There are however instructions that do ‘half a scalar product’. We found out an approach to maximize the throughput using this.
Our camera, a Pointgrey Bumblebee, delivers raw sensor data via Firewire, arranged in a pattern called Bayer Encoding. Color data is arranged in 2x2 cells, and there are twice the sensors for green than for the the other colors, since the human eye is more sensible to that color. We at first rearrange the input data in a strange but useful pattern, as in picture. The following assembler code then does the magic, two pixel at a time.
//Loading mm0 with 0, this will be useful to interleave data byte pxor %mm0,%mm0 //Loading 8 bytes from buffer. Assume %eax contains the address of the input buffer //One out of four bytes are zeros, but the overhead is well balanced by the aligned memory access. //Those zeros will also be useful later on movd (%eax),%mm1 // < R1, G1, B2, 0> movd 4(%eax),%mm2 // < B1, 0, R2, G2> //Unpacking bytes to words, MMX registers are 8 bytes wide so we can interleave the data bytes with zeros. punpcklbw %mm0,%mm1 punpcklbw %mm0,%mm2 //We need triple copy of each input, one for each output channel movq %mm1,%mm3 // < R1, G1, B2, 0> movq %mm2,%mm4 // < B1, 0, R2, G2> movq %mm1,%mm5 // < R1, G1, B2, 0> movq %mm2,%mm6 // < B1, 0, R2, G2> //Multiply and accumulate, this does only half the work. //We multiply the data with the right constants and sums the results in pair. //The consts are four packed 16bit values and contains the constants scaled by 32768. //[YUV]const and [YUV]const_inv are the same apart from being arrenged to suit the layout of the even/odd inputs pmaddwd Yconst,%mm1 // < Y1*R1 + Y2*G1, Y3*B2 + 0> pmaddwd Uconst,%mm3 // < U1*R1 + U2*G1, U3*B2 + 0> pmaddwd Vconst,%mm5 // < V1*R1 + V2*G1, V3*B2 + 0> pmaddwd Yconst_inv,%mm2 // < Y3*B1 + 0, Y1*R2 + Y2*G2> pmaddwd Uconst_inv,%mm4 // < U3*B1 + 0, U1*R2 + U2*G2> pmaddwd Vconst_inv,%mm6 // < V3*B1 + 0, V1*R2 + V2*G2> //Add registers in pairs to get the final scalar product. The results are two packed pixel for each output channel and still scaled by 32768 paddd %mm2,%mm1 // < Y1*R1 + Y2*G1 + Y3*B1, Y1*R2, Y2*G2 + Y3*B2> paddd %mm4,%mm3 // < U1*R1 + U2*G1 + U3*B1, U1*R2, U2*G2 + U3*B2> paddd %mm6,%mm5 // < V1*R1 + V2*G1 + V3*B1, V1*R2, V2*G2 + V3*B2> //We shift right by 15 bits to get rid of the scaling psrad $15,%mm1 psrad $15,%mm3 psrad $15,%mm5 //const128 is two packed 32bit values, this is the offset to be added to the U/V channnels //const128: // .long 128 // .long 128 paddd const128,%mm3 paddd const128,%mm5 //We repack the resulting dwords to bytes packssdw %mm0,%mm1 packssdw %mm0,%mm3 packssdw %mm0,%mm5 packuswb %mm0,%mm1 packuswb %mm0,%mm3 packuswb %mm0,%mm5 //We copy the byte pairs to the destination buffers, assume %ebx, %esi and %edi contains the address of such buffers movd %mm1,%ecx movw %cx,(%ebx) movd %mm3,%ecx movb %cl,(%esi) movd %mm5,%ecx movb %cl,(%edi) |
Simple right?
Coding this was difficult but in the end really interesting. And even more important, this was really fast and we had no problem using this during the robot competition itself. Read the rest of this entry »