Posts Tagged SSE

Allocating aligned memory

Just a quick note that may be use­ful to some­one else. As you may know SSE2 intro­duced a new instruc­tion: MOVDQA (MOVe Dou­ble Quad­word Aligned). This is used to move 128 bit (16 bytes) of data from/to memory/xmm reg­is­ters. This instruc­tion only works if the data is aligned the the 16 byte bound­ary. There is also another instruc­tion for the unaligned case, but the aligned ver­sion is way faster. So let’s sum­ma­rize some tech­niques to get an aligned mem­ory area

  • For local, sta­tic and mem­ber vari­ables you can append __attribute__ (( aligned (16 ) ) to the type def­i­n­i­tion. Example:
  • struct A { int val; } __attribute__ ((aligned ( 16 ) );

  • For dynam­i­cally allo­cated mem­ory the usual mal­loc is not enough, but there is a posix_memalign which has the seman­tics that we need. It is defined as:
  • int posix_memalign(void **memptr, size_t alignment, size_t size);

So we have to pass a pointer to the pointer that will receive the newly allo­cated mem­ory, the required align­ment (which has to be a power of two) and the allo­ca­tion size. Mem­ory allo­cated this way can (at least on the glibc imple­men­ta­tion) be freed using the usual free func­tion.

, , , , , ,

No Comments

Google Chrome bug #9007, should we care?

Much dis­cus­sion spawned from Google Chrome bug #9007. The prob­lem is actu­allty quite sim­ple: Chrome depends on SSE2 instruc­tions and so, when run on proces­sors which do not sup­port such exten­sion, will crash cry­ing ‘Ille­gal instruc­tion’ . This arcane look­ing mes­sage sim­ply means: “Come on my friend, go buy a new computer.”

To under­stand my opin­ion, let me talk a bit about the SSE fam­ily of extensions.

MMX/SSE intro­duced in main­stream com­put­ing the SIMD com­pu­ta­tional model. SIMD means ‘Sin­gle Instruc­tion Mul­ti­ple Data’, so for each instruc­tion the same com­pu­ta­tion is exe­cuted over sev­eral indipen­dent data. This kind of instruc­tions are extremely use­ful in  math­e­mat­ics and mul­ti­me­dia appli­ca­tions. I don’t think SSE can be right­fully called an exten­sion. It’s actu­ally a nec­es­sary fea­ture which was miss­ing in the early Intel desings. SSE2 was intro­duced back in 2001. And it’s cur­rently sup­ported on every mod­ern proces­sor, from Atoms to bleed­ing edge quad core Xeons. Keep in mind that by using SIMD instruc­tions it is pos­si­ble to speedup the code by a fac­tor of 4 or 8. And this could make the dif­fer­ence between a real­time and an offline application.

I really think there is no rea­son for Google devel­op­ers to waste time and resource to sup­port obso­les­cent machines. Good code should be eff­i­cent in terms of proces­sor time, power con­sump­tion and mem­ory allo­ca­tion. To obtain such goals it is often nec­es­sary to exploit new features.

This con­sid­er­a­tion is also the foun­da­tion of the lightspark project design. I’m mak­ing use of every fea­ture a mod­ern plat­form offers, such as heavy mul­ti­thread­ing sup­port, mul­ti­le­dia exten­sions and pro­gram­ma­ble graphic cards. All of this to obtain a soft­ware which is fast and lean on resources, even on lim­ited plat­form such as Mobile Inter­net Deviced and sub-notebooks.

, , , ,

3 Comments

Case Study: Real Time video encoding on Via Epia, Part II

Once upon a time, there was the glo­ri­ous empire of DOS. It was a mighty and fear­ful time, when peo­ple could still talk to the heart of the machines and write code in the for­got­ten lan­guage of assem­bly. We are now lucky enough to have pow­er­ful com­pil­ers that do most of this low level work for us, and hand craft­ing assem­bly code is not needed any­omore. But the intro­duc­tion of SIMD (Sin­gle Instruc­tion Mul­ti­ple Data) within the x86 instruc­tion set made this ancient abil­ity use­ful again.

MMX/SSE are a very pow­er­ful and dan­ger­ous beast. We had almost no pre­vi­ous expe­ri­ence with low level assem­bly pro­gram­ming. And a crit­i­cal prob­lem to solve: how to con­vert from RGB col­or­space to YUV, and do it fast on our very lim­ited board.
As I’ve wrote on the pre­vi­ous arti­cle the con­ver­sion is con­cep­tu­ally sim­ple and it’s basi­cally a 3x3 matrix mul­ti­pli­ca­tion. That’s it, do 9 scalar prod­ucts and you’re done!

vect11

SIMD instruc­tions oper­ate on packed data: this means that more than one (usu­ally 2/4) value is stored in a sin­gle reg­is­ter and oper­a­tions on them are par­al­lelized. For exam­ple you can do four sums with a sin­gle oper­a­tion.
Unfor­tu­nately MMX/SSE is a ver­ti­cal instruc­tion set. This means you can do very lit­tle with the data packed in a sin­gle reg­is­ter. There are how­ever instruc­tions that do ‘half a scalar prod­uct’. We found out an approach to max­i­mize the through­put using this.

Our cam­era, a Point­grey Bum­ble­bee, deliv­ers raw sen­sor data via Firewire, arranged in a pat­tern called Bayer Encod­ing. Color data is arranged in 2x2 cells, and there are twice the sen­sors for green than for the the other col­ors, since the human eye is more sen­si­ble to that color. We at first rearrange the input data in a strange but use­ful pat­tern, as in pic­ture. The fol­low­ing assem­bler code then does the magic, two pixel at a time.

//Loading mm0 with 0, this will be useful to interleave data byte
pxor %mm0,%mm0
 
//Loading 8 bytes from buffer. Assume %eax contains the address of the input buffer
//One out of four bytes are zeros, but the overhead is well balanced by the aligned memory access.
//Those zeros will also be useful later on
movd (%eax),%mm1 // < R1, G1, B2, 0>
movd 4(%eax),%mm2 // < B1, 0, R2, G2>
//Unpacking bytes to words, MMX registers are 8 bytes wide so we can interleave the data bytes with zeros.
punpcklbw %mm0,%mm1
punpcklbw %mm0,%mm2
 
//We need triple copy of each input, one for each output channel
movq %mm1,%mm3 // < R1, G1, B2, 0>
movq %mm2,%mm4 // < B1, 0, R2, G2>
movq %mm1,%mm5 // < R1, G1, B2, 0>
movq %mm2,%mm6 // < B1, 0, R2, G2>
 
//Multiply and accumulate, this does only half the work.
//We multiply the data with the right constants and sums the results in pair.
//The consts are four packed 16bit values and contains the constants scaled by 32768.
//[YUV]const and [YUV]const_inv are the same apart from being arrenged to suit the layout of the even/odd inputs
pmaddwd Yconst,%mm1 // < Y1*R1 + Y2*G1, Y3*B2 + 0>
pmaddwd Uconst,%mm3 // < U1*R1 + U2*G1, U3*B2 + 0>
pmaddwd Vconst,%mm5 // < V1*R1 + V2*G1, V3*B2 + 0>
 
pmaddwd Yconst_inv,%mm2 // < Y3*B1 + 0, Y1*R2 + Y2*G2>
pmaddwd Uconst_inv,%mm4 // < U3*B1 + 0, U1*R2 + U2*G2>
pmaddwd Vconst_inv,%mm6 // < V3*B1 + 0, V1*R2 + V2*G2>
 
//Add registers in pairs to get the final scalar product. The results are two packed pixel for each output channel and still scaled by 32768
paddd %mm2,%mm1 // < Y1*R1 + Y2*G1 + Y3*B1, Y1*R2, Y2*G2 + Y3*B2>
paddd %mm4,%mm3 // < U1*R1 + U2*G1 + U3*B1, U1*R2, U2*G2 + U3*B2>
paddd %mm6,%mm5 // < V1*R1 + V2*G1 + V3*B1, V1*R2, V2*G2 + V3*B2>
 
//We shift right by 15 bits to get rid of the scaling
psrad $15,%mm1
psrad $15,%mm3
psrad $15,%mm5
 
//const128 is two packed 32bit values, this is the offset to be added to the U/V channnels
//const128:
// .long 128
// .long 128
paddd const128,%mm3
paddd const128,%mm5
 
//We repack the resulting dwords to bytes
packssdw %mm0,%mm1
packssdw %mm0,%mm3
packssdw %mm0,%mm5
 
packuswb %mm0,%mm1
packuswb %mm0,%mm3
packuswb %mm0,%mm5
 
//We copy the byte pairs to the destination buffers, assume %ebx, %esi and %edi contains the address of such buffers
movd %mm1,%ecx
movw %cx,(%ebx)
movd %mm3,%ecx
movb %cl,(%esi)
movd %mm5,%ecx
movb %cl,(%edi)

Sim­ple right? :-)

Cod­ing this was dif­fi­cult but in the end really inter­est­ing. And even more impor­tant, this was really fast and we had no prob­lem using this dur­ing the robot com­pe­ti­tion itself. Read the rest of this entry »

, , , , , , ,

No Comments