Posts Tagged assembly
Lightspark News: Progress on stability, Codenames and Logo Poll
Posted by Alessandro Pignotti in Projects on May 23, 2010
First of all, thanks a lot to all those brave enough to try out this project. I’m sorry about all the (frequent) crashes but, with the help of all the people who filed bugs on launchpad, the stability of Lightspark is improving very fast. Please keep testing and reporting any issues. The next big release, 0.4.0 codenamed “Aeolus”, is planned for the first week of June. The focus for this release is the stability of the platform and no major features are being implemented. The release is also going to include a brand new logo! The call for logos of the previous post generated a lot of very nice works, and it was very hard to choose between them. In the end I managed to keep only two of them, and now it’s your turn! Vote for the one you prefer.
Beside aesthetic things I’m also trying to define a bit the roadmap of the project. If the next release is only focused on stability, for the following one (0.5.0, codenamed “Bacchus”) I’m planning working Youtube support which was lost after one of the updates of the video player.
I’ve also received a lot questions and interest about porting Lightspark to other OSs and architectures. The code is build using standard technologies, such as pthreads and STL and should be quite portable, but some critical code paths has been written in assembly to guarantee atomicity or improve performance. I’ve very little experience with anything beside x86/x86-64, so I prefer not to port such critical code. However I will gladly accept any contributions for other platforms, such as PPC and ARM. The good news is that a contributor managed to compile lightspark on FreeBSD/x86 with minimal changes to the build system and a windows port is also planned. Moreover, beside the Ubuntu PPA I’m maintaining, packages are being created for Arch Linux and Debian, thanks a lot to the community.
ActionScript meets LLVM: part I
Posted by Alessandro Pignotti in Insane Projects on May 17, 2009
One of the major challenges in the design of lightspark is the ActionScript execution engine. Most of the more recent flash content is almost completely build out of the ActionScript technology, which with version 3.0 matured enough to become a foundational block of the current, and probably future web. The same technology is going to become also widespread offline if the Adobe AIR platform succeedes as a cross platform application framework.
But what is ActionScript? Basically it is an almost ECMAscript complaiant language; the specification covers the language itself, a huge library of components and the bytecode format that is used to deliver code to the clients, usually as part of a SWF (flash) file.
The bytecode models a stack-machine as most of the arguments are passed on the stack and not as operands in the code. This operational description — although quite dense — requires a lot of stack traffic, even for simple computations. It should be noted that modern x86/amd64 processors employ specific stack tracing units to optimize out such traffic, but this is highly architecture dependent and not guaranteed.
LLVM (which stands fot Low-Level Virtual Machine) is on the other hand based on an Intermediate Language in SSA form. This means that each symbol can be assigned only one time. This form is extremely useful when doing optimization over the code. LLVM offers a nice interface for a bunch of feature, most notably sophisticated optimization of the code and Just-In-Time compilation to native assemply.
The challenge is: how to exploit llvm power to build a fast ActionScript engine.
The answer is, as usual, a matter of compromises. Quite a lot of common usage patterns of the stack-machine can be heavily optimized with limited work, for example most of the data pushed on the stack is going to be used right away! More details on this on the next issue...
Case Study: Real Time video encoding on Via Epia, Part II
Posted by Alessandro Pignotti in Coding tricks on February 6, 2009
Once upon a time, there was the glorious empire of DOS. It was a mighty and fearful time, when people could still talk to the heart of the machines and write code in the forgotten language of assembly. We are now lucky enough to have powerful compilers that do most of this low level work for us, and hand crafting assembly code is not needed anyomore. But the introduction of SIMD (Single Instruction Multiple Data) within the x86 instruction set made this ancient ability useful again.
MMX/SSE are a very powerful and dangerous beast. We had almost no previous experience with low level assembly programming. And a critical problem to solve: how to convert from RGB colorspace to YUV, and do it fast on our very limited board.
As I’ve wrote on the previous article the conversion is conceptually simple and it’s basically a 3x3 matrix multiplication. That’s it, do 9 scalar products and you’re done!
SIMD instructions operate on packed data: this means that more than one (usually 2/4) value is stored in a single register and operations on them are parallelized. For example you can do four sums with a single operation.
Unfortunately MMX/SSE is a vertical instruction set. This means you can do very little with the data packed in a single register. There are however instructions that do ‘half a scalar product’. We found out an approach to maximize the throughput using this.
Our camera, a Pointgrey Bumblebee, delivers raw sensor data via Firewire, arranged in a pattern called Bayer Encoding. Color data is arranged in 2x2 cells, and there are twice the sensors for green than for the the other colors, since the human eye is more sensible to that color. We at first rearrange the input data in a strange but useful pattern, as in picture. The following assembler code then does the magic, two pixel at a time.
//Loading mm0 with 0, this will be useful to interleave data byte pxor %mm0,%mm0 //Loading 8 bytes from buffer. Assume %eax contains the address of the input buffer //One out of four bytes are zeros, but the overhead is well balanced by the aligned memory access. //Those zeros will also be useful later on movd (%eax),%mm1 // < R1, G1, B2, 0> movd 4(%eax),%mm2 // < B1, 0, R2, G2> //Unpacking bytes to words, MMX registers are 8 bytes wide so we can interleave the data bytes with zeros. punpcklbw %mm0,%mm1 punpcklbw %mm0,%mm2 //We need triple copy of each input, one for each output channel movq %mm1,%mm3 // < R1, G1, B2, 0> movq %mm2,%mm4 // < B1, 0, R2, G2> movq %mm1,%mm5 // < R1, G1, B2, 0> movq %mm2,%mm6 // < B1, 0, R2, G2> //Multiply and accumulate, this does only half the work. //We multiply the data with the right constants and sums the results in pair. //The consts are four packed 16bit values and contains the constants scaled by 32768. //[YUV]const and [YUV]const_inv are the same apart from being arrenged to suit the layout of the even/odd inputs pmaddwd Yconst,%mm1 // < Y1*R1 + Y2*G1, Y3*B2 + 0> pmaddwd Uconst,%mm3 // < U1*R1 + U2*G1, U3*B2 + 0> pmaddwd Vconst,%mm5 // < V1*R1 + V2*G1, V3*B2 + 0> pmaddwd Yconst_inv,%mm2 // < Y3*B1 + 0, Y1*R2 + Y2*G2> pmaddwd Uconst_inv,%mm4 // < U3*B1 + 0, U1*R2 + U2*G2> pmaddwd Vconst_inv,%mm6 // < V3*B1 + 0, V1*R2 + V2*G2> //Add registers in pairs to get the final scalar product. The results are two packed pixel for each output channel and still scaled by 32768 paddd %mm2,%mm1 // < Y1*R1 + Y2*G1 + Y3*B1, Y1*R2, Y2*G2 + Y3*B2> paddd %mm4,%mm3 // < U1*R1 + U2*G1 + U3*B1, U1*R2, U2*G2 + U3*B2> paddd %mm6,%mm5 // < V1*R1 + V2*G1 + V3*B1, V1*R2, V2*G2 + V3*B2> //We shift right by 15 bits to get rid of the scaling psrad $15,%mm1 psrad $15,%mm3 psrad $15,%mm5 //const128 is two packed 32bit values, this is the offset to be added to the U/V channnels //const128: // .long 128 // .long 128 paddd const128,%mm3 paddd const128,%mm5 //We repack the resulting dwords to bytes packssdw %mm0,%mm1 packssdw %mm0,%mm3 packssdw %mm0,%mm5 packuswb %mm0,%mm1 packuswb %mm0,%mm3 packuswb %mm0,%mm5 //We copy the byte pairs to the destination buffers, assume %ebx, %esi and %edi contains the address of such buffers movd %mm1,%ecx movw %cx,(%ebx) movd %mm3,%ecx movb %cl,(%esi) movd %mm5,%ecx movb %cl,(%edi) |
Simple right?
Coding this was difficult but in the end really interesting. And even more important, this was really fast and we had no problem using this during the robot competition itself. Read the rest of this entry »
Case Study: Real Time video encoding on Via Epia, Part I
Posted by Alessandro Pignotti in Coding tricks on February 2, 2009
During the pESApod project we worked on the telecommunication and telemetry system for the robot. The computing infrastructure was very complex (maybe too much). We had three Altera FPGA on board and a very low power consumption PC, a VIA Epia board. Using devices that are light on power needs is a must for mobile robots. But we ended up using more power for the electronics than for the motors. I guess the Altera’s boards are very heavy on power, being prototyping devices.
Anyway the Epia with the onboard Eden processor is a very nice machine. It is fully x86 compatible, and we managed to run linux on it without problems. It has indeed a very low power footprint, but the performance tradeoff for this was quite heavy. The original plan was to have four video streams from the robot. A pair of proximity cameras for sample gathering and a stereocam for navigation and environment mapping. We used at the end only the stereocam, but even encoding only those two video streams on the Epia was really difficult.
We used libFAME for the encoding. The name means Fast Assembly MPEG Encoder. It is fast indeed, but it is also very poorly mantained. So we had some problems at firts to make it work. The library accept frames encoded in YUV format, but our camera sensor data was in bayer encoding. So we had to write the format conversion routine.
The conversion from RGB color space to YUV is quite simple and can be done using linear algebra. Our first approach was really naive and based on floating point.
// RGB* rgb; // YUV* yuv; yuv[i].y=0.299*rgb[i].r + 0.114*rgb[i].g + 0.587*rgb[i].b; yuv[i].u=128 - 0.168736*rgb[i].r - 0.331264*rgb[i].g + 0.5*rgb[i].b; yuv[i].v=128 + 0.5*rgb[i].r - 0.418688*rgb[i].g + 0.081312*rgb[i].b; |
This was really slow. We later discovered to our disappointment that the FPU was clocked at half the speed of the processor. So we changed the implementation to integer math. The result was something like this:
yuv[i].y=(299*rgb[i].r + 0.114*rgb[i].g + 0.587*rgb[i].b)/1000; yuv[i].u=128 - (169*rgb[i].r - 331*rgb[i].g + 500*rgb[i].b)/1000; yuv[i].v=128 + (500*rgb[i].r - 419*rgb[i].g + 81*rgb[i].b)/1000; |
This solution almost doubled the framerate. But it was still not enough and we had to dive deep in the magic world of MMX/SSE instructions. The details for the next issue.