Posts Tagged SSE

Allocating aligned memory

Posted by Alessandro Pignotti in Coding tricks on October 29, 2009

Just a quick note that may be useful to someone else. As you may know SSE2 introduced a new instruction: MOVDQA (MOVe Double Quadword Aligned). This is used to move 128 bit (16 bytes) of data from/to memory/xmm registers. This instruction only works if the data is aligned the the 16 byte boundary. There is also another instruction for the unaligned case, but the aligned version is way faster. So let’s summarize some techniques to get an aligned memory area

For local, static and member variables you can append __attribute__ (( aligned (16 ) ) to the type definition. Example:

struct A { int val; } __attribute__ ((aligned ( 16 ) );

For dynamically allocated memory the usual malloc is not enough, but there is a posix_memalign which has the semantics that we need. It is defined as:

int posix_memalign(void **memptr, size_t alignment, size_t size);

So we have to pass a pointer to the pointer that will receive the newly allocated memory, the required alignment (which has to be a power of two) and the allocation size. Memory allocated this way can (at least on the glibc implementation) be freed using the usual free function.

aligned memory, glibc, linux, MMX, performance, SSE, x86

No Comments

Google Chrome bug #9007, should we care?

Posted by Alessandro Pignotti in Bar discussions on May 29, 2009

Much discussion spawned from Google Chrome bug #9007. The problem is actuallty quite simple: Chrome depends on SSE2 instructions and so, when run on processors which do not support such extension, will crash crying ‘Illegal instruction’ . This arcane looking message simply means: “Come on my friend, go buy a new computer.”

To understand my opinion, let me talk a bit about the SSE family of extensions.

MMX/SSE introduced in mainstream computing the SIMD computational model. SIMD means ‘Single Instruction Multiple Data’, so for each instruction the same computation is executed over several indipendent data. This kind of instructions are extremely useful in mathematics and multimedia applications. I don’t think SSE can be rightfully called an extension. It’s actually a necessary feature which was missing in the early Intel desings. SSE2 was introduced back in 2001. And it’s currently supported on every modern processor, from Atoms to bleeding edge quad core Xeons. Keep in mind that by using SIMD instructions it is possible to speedup the code by a factor of 4 or 8. And this could make the difference between a realtime and an offline application.

I really think there is no reason for Google developers to waste time and resource to support obsolescent machines. Good code should be efficent in terms of processor time, power consumption and memory allocation. To obtain such goals it is often necessary to exploit new features.

This consideration is also the foundation of the lightspark project design. I’m making use of every feature a modern platform offers, such as heavy multithreading support, multiledia extensions and programmable graphic cards. All of this to obtain a software which is fast and lean on resources, even on limited platform such as Mobile Internet Deviced and sub-notebooks.

chrome, Lightspark, MMX, simd, SSE

3 Comments

Case Study: Real Time video encoding on Via Epia, Part II

Posted by Alessandro Pignotti in Coding tricks on February 6, 2009

Once upon a time, there was the glorious empire of DOS. It was a mighty and fearful time, when people could still talk to the heart of the machines and write code in the forgotten language of assembly. We are now lucky enough to have powerful compilers that do most of this low level work for us, and hand crafting assembly code is not needed anyomore. But the introduction of SIMD (Single Instruction Multiple Data) within the x86 instruction set made this ancient ability useful again.

MMX/SSE are a very powerful and dangerous beast. We had almost no previous experience with low level assembly programming. And a critical problem to solve: how to convert from RGB colorspace to YUV, and do it fast on our very limited board.
As I’ve wrote on the previous article the conversion is conceptually simple and it’s basically a 3x3 matrix multiplication. That’s it, do 9 scalar products and you’re done!

vect11

SIMD instructions operate on packed data: this means that more than one (usually 2/4) value is stored in a single register and operations on them are parallelized. For example you can do four sums with a single operation.
Unfortunately MMX/SSE is a vertical instruction set. This means you can do very little with the data packed in a single register. There are however instructions that do ‘half a scalar product’. We found out an approach to maximize the throughput using this.

Our camera, a Pointgrey Bumblebee, delivers raw sensor data via Firewire, arranged in a pattern called Bayer Encoding. Color data is arranged in 2x2 cells, and there are twice the sensors for green than for the the other colors, since the human eye is more sensible to that color. We at first rearrange the input data in a strange but useful pattern, as in picture. The following assembler code then does the magic, two pixel at a time.

//Loading mm0 with 0, this will be useful to interleave data byte
pxor %mm0,%mm0
 
//Loading 8 bytes from buffer. Assume %eax contains the address of the input buffer
//One out of four bytes are zeros, but the overhead is well balanced by the aligned memory access.
//Those zeros will also be useful later on
movd (%eax),%mm1 // &lt; R1, G1, B2, 0&gt;
movd 4(%eax),%mm2 // &lt; B1, 0, R2, G2&gt;
//Unpacking bytes to words, MMX registers are 8 bytes wide so we can interleave the data bytes with zeros.
punpcklbw %mm0,%mm1
punpcklbw %mm0,%mm2
 
//We need triple copy of each input, one for each output channel
movq %mm1,%mm3 // &lt; R1, G1, B2, 0&gt;
movq %mm2,%mm4 // &lt; B1, 0, R2, G2&gt;
movq %mm1,%mm5 // &lt; R1, G1, B2, 0&gt;
movq %mm2,%mm6 // &lt; B1, 0, R2, G2&gt;
 
//Multiply and accumulate, this does only half the work.
//We multiply the data with the right constants and sums the results in pair.
//The consts are four packed 16bit values and contains the constants scaled by 32768.
//[YUV]const and [YUV]const_inv are the same apart from being arrenged to suit the layout of the even/odd inputs
pmaddwd Yconst,%mm1 // &lt; Y1*R1 + Y2*G1, Y3*B2 + 0&gt;
pmaddwd Uconst,%mm3 // &lt; U1*R1 + U2*G1, U3*B2 + 0&gt;
pmaddwd Vconst,%mm5 // &lt; V1*R1 + V2*G1, V3*B2 + 0&gt;
 
pmaddwd Yconst_inv,%mm2 // &lt; Y3*B1 + 0, Y1*R2 + Y2*G2&gt;
pmaddwd Uconst_inv,%mm4 // &lt; U3*B1 + 0, U1*R2 + U2*G2&gt;
pmaddwd Vconst_inv,%mm6 // &lt; V3*B1 + 0, V1*R2 + V2*G2&gt;
 
//Add registers in pairs to get the final scalar product. The results are two packed pixel for each output channel and still scaled by 32768
paddd %mm2,%mm1 // &lt; Y1*R1 + Y2*G1 + Y3*B1, Y1*R2, Y2*G2 + Y3*B2&gt;
paddd %mm4,%mm3 // &lt; U1*R1 + U2*G1 + U3*B1, U1*R2, U2*G2 + U3*B2&gt;
paddd %mm6,%mm5 // &lt; V1*R1 + V2*G1 + V3*B1, V1*R2, V2*G2 + V3*B2&gt;
 
//We shift right by 15 bits to get rid of the scaling
psrad $15,%mm1
psrad $15,%mm3
psrad $15,%mm5
 
//const128 is two packed 32bit values, this is the offset to be added to the U/V channnels
//const128:
// .long 128
// .long 128
paddd const128,%mm3
paddd const128,%mm5
 
//We repack the resulting dwords to bytes
packssdw %mm0,%mm1
packssdw %mm0,%mm3
packssdw %mm0,%mm5
 
packuswb %mm0,%mm1
packuswb %mm0,%mm3
packuswb %mm0,%mm5
 
//We copy the byte pairs to the destination buffers, assume %ebx, %esi and %edi contains the address of such buffers
movd %mm1,%ecx
movw %cx,(%ebx)
movd %mm3,%ecx
movb %cl,(%esi)
movd %mm5,%ecx
movb %cl,(%edi)

Simple right?

Coding this was difficult but in the end really interesting. And even more important, this was really fast and we had no problem using this during the robot competition itself. Read the rest of this entry »

assembly, MMX, optimization, pESApod, RGB, SSE, Via Epia, YUV

No Comments

Technology Temple

Posts Tagged SSE

Allocating aligned memory

Google Chrome bug #9007, should we care?

Case Study: Real Time video encoding on Via Epia, Part II

Categories

Blogroll

Archives

Meta