Archive for February, 2009

Samba upgrade headache

Even a Debian machine when it’s not nursed by the lov­ing hands of a sys­tem admin­is­tra­tor for a long time could be a source of prob­lems. I found myself upgrad­ing samba from ver­sion 3.0.24 to 3.2.5 all at once, on our main file­server. And sud­denly all the win­dows machines here at school could not access the shares any­more. This prob­lems seems not to be doc­u­mented any­were. So I took a deep breath and start scrolling the huge samba changelog between the old and the new ver­sion. How­ever I was lucky, the prob­lem­atic change hap­pened at ver­sion 3.0.25a. It seems that the default value of the msdsf root option changed from true to false, but Win­dows cached this infor­ma­tion. To solve the prob­lem the solu­tion is the usual: just reboot windows.

, ,

1 Comment

Samba upgrade headache

Even a Debian machine when it’s not nursed by the lov­ing hands of a sys­tem admin­is­tra­tor for a long time could be a source of prob­lems. I found myself upgrad­ing samba from ver­sion 3.0.24 to 3.2.5 all at once, on our main file­server. And sud­denly all the win­dows machines here at school could not access the shares any­more. This prob­lems seems not to be doc­u­mented any­were. So I took a deep breath and start scrolling the huge samba changelog between the old and the new ver­sion. How­ever I was lucky, the prob­lem­atic change hap­pened at ver­sion 3.0.25a. It seems that the default value of the msdsf root option changed from true to false, but Win­dows cached this infor­ma­tion. To solve the prob­lem the solu­tion is the usual: just reboot windows.

, ,

1 Comment

Case Study: Real Time video encoding on Via Epia, Part II

Once upon a time, there was the glo­ri­ous empire of DOS. It was a mighty and fear­ful time, when peo­ple could still talk to the heart of the machines and write code in the for­got­ten lan­guage of assem­bly. We are now lucky enough to have pow­er­ful com­pil­ers that do most of this low level work for us, and hand craft­ing assem­bly code is not needed any­omore. But the intro­duc­tion of SIMD (Sin­gle Instruc­tion Mul­ti­ple Data) within the x86 instruc­tion set made this ancient abil­ity use­ful again.

MMX/SSE are a very pow­er­ful and dan­ger­ous beast. We had almost no pre­vi­ous expe­ri­ence with low level assem­bly pro­gram­ming. And a crit­i­cal prob­lem to solve: how to con­vert from RGB col­or­space to YUV, and do it fast on our very lim­ited board.
As I’ve wrote on the pre­vi­ous arti­cle the con­ver­sion is con­cep­tu­ally sim­ple and it’s basi­cally a 3x3 matrix mul­ti­pli­ca­tion. That’s it, do 9 scalar prod­ucts and you’re done!

vect11

SIMD instruc­tions oper­ate on packed data: this means that more than one (usu­ally 2/4) value is stored in a sin­gle reg­is­ter and oper­a­tions on them are par­al­lelized. For exam­ple you can do four sums with a sin­gle oper­a­tion.
Unfor­tu­nately MMX/SSE is a ver­ti­cal instruc­tion set. This means you can do very lit­tle with the data packed in a sin­gle reg­is­ter. There are how­ever instruc­tions that do ‘half a scalar prod­uct’. We found out an approach to max­i­mize the through­put using this.

Our cam­era, a Point­grey Bum­ble­bee, deliv­ers raw sen­sor data via Firewire, arranged in a pat­tern called Bayer Encod­ing. Color data is arranged in 2x2 cells, and there are twice the sen­sors for green than for the the other col­ors, since the human eye is more sen­si­ble to that color. We at first rearrange the input data in a strange but use­ful pat­tern, as in pic­ture. The fol­low­ing assem­bler code then does the magic, two pixel at a time.

//Loading mm0 with 0, this will be useful to interleave data byte
pxor %mm0,%mm0
 
//Loading 8 bytes from buffer. Assume %eax contains the address of the input buffer
//One out of four bytes are zeros, but the overhead is well balanced by the aligned memory access.
//Those zeros will also be useful later on
movd (%eax),%mm1 // < R1, G1, B2, 0>
movd 4(%eax),%mm2 // < B1, 0, R2, G2>
//Unpacking bytes to words, MMX registers are 8 bytes wide so we can interleave the data bytes with zeros.
punpcklbw %mm0,%mm1
punpcklbw %mm0,%mm2
 
//We need triple copy of each input, one for each output channel
movq %mm1,%mm3 // < R1, G1, B2, 0>
movq %mm2,%mm4 // < B1, 0, R2, G2>
movq %mm1,%mm5 // < R1, G1, B2, 0>
movq %mm2,%mm6 // < B1, 0, R2, G2>
 
//Multiply and accumulate, this does only half the work.
//We multiply the data with the right constants and sums the results in pair.
//The consts are four packed 16bit values and contains the constants scaled by 32768.
//[YUV]const and [YUV]const_inv are the same apart from being arrenged to suit the layout of the even/odd inputs
pmaddwd Yconst,%mm1 // < Y1*R1 + Y2*G1, Y3*B2 + 0>
pmaddwd Uconst,%mm3 // < U1*R1 + U2*G1, U3*B2 + 0>
pmaddwd Vconst,%mm5 // < V1*R1 + V2*G1, V3*B2 + 0>
 
pmaddwd Yconst_inv,%mm2 // < Y3*B1 + 0, Y1*R2 + Y2*G2>
pmaddwd Uconst_inv,%mm4 // < U3*B1 + 0, U1*R2 + U2*G2>
pmaddwd Vconst_inv,%mm6 // < V3*B1 + 0, V1*R2 + V2*G2>
 
//Add registers in pairs to get the final scalar product. The results are two packed pixel for each output channel and still scaled by 32768
paddd %mm2,%mm1 // < Y1*R1 + Y2*G1 + Y3*B1, Y1*R2, Y2*G2 + Y3*B2>
paddd %mm4,%mm3 // < U1*R1 + U2*G1 + U3*B1, U1*R2, U2*G2 + U3*B2>
paddd %mm6,%mm5 // < V1*R1 + V2*G1 + V3*B1, V1*R2, V2*G2 + V3*B2>
 
//We shift right by 15 bits to get rid of the scaling
psrad $15,%mm1
psrad $15,%mm3
psrad $15,%mm5
 
//const128 is two packed 32bit values, this is the offset to be added to the U/V channnels
//const128:
// .long 128
// .long 128
paddd const128,%mm3
paddd const128,%mm5
 
//We repack the resulting dwords to bytes
packssdw %mm0,%mm1
packssdw %mm0,%mm3
packssdw %mm0,%mm5
 
packuswb %mm0,%mm1
packuswb %mm0,%mm3
packuswb %mm0,%mm5
 
//We copy the byte pairs to the destination buffers, assume %ebx, %esi and %edi contains the address of such buffers
movd %mm1,%ecx
movw %cx,(%ebx)
movd %mm3,%ecx
movb %cl,(%esi)
movd %mm5,%ecx
movb %cl,(%edi)

Sim­ple right? :-)

Cod­ing this was dif­fi­cult but in the end really inter­est­ing. And even more impor­tant, this was really fast and we had no prob­lem using this dur­ing the robot com­pe­ti­tion itself. Read the rest of this entry »

, , , , , , ,

No Comments

Case Study: Real Time video encoding on Via Epia, Part II

Once upon a time, there was the glo­ri­ous empire of DOS. It was a mighty and fear­ful time, when peo­ple could still talk to the heart of the machines and write code in the for­got­ten lan­guage of assem­bly. We are now lucky enough to have pow­er­ful com­pil­ers that do most of this low level work for us, and hand craft­ing assem­bly code is not needed any­omore. But the intro­duc­tion of SIMD (Sin­gle Instruc­tion Mul­ti­ple Data) within the x86 instruc­tion set made this ancient abil­ity use­ful again.

MMX/SSE are a very pow­er­ful and dan­ger­ous beast. We had almost no pre­vi­ous expe­ri­ence with low level assem­bly pro­gram­ming. And a crit­i­cal prob­lem to solve: how to con­vert from RGB col­or­space to YUV, and do it fast on our very lim­ited board.
As I’ve wrote on the pre­vi­ous arti­cle the con­ver­sion is con­cep­tu­ally sim­ple and it’s basi­cally a 3x3 matrix mul­ti­pli­ca­tion. That’s it, do 9 scalar prod­ucts and you’re done!

vect11

SIMD instruc­tions oper­ate on packed data: this means that more than one (usu­ally 2/4) value is stored in a sin­gle reg­is­ter and oper­a­tions on them are par­al­lelized. For exam­ple you can do four sums with a sin­gle oper­a­tion.
Unfor­tu­nately MMX/SSE is a ver­ti­cal instruc­tion set. This means you can do very lit­tle with the data packed in a sin­gle reg­is­ter. There are how­ever instruc­tions that do ‘half a scalar prod­uct’. We found out an approach to max­i­mize the through­put using this.

Our cam­era, a Point­grey Bum­ble­bee, deliv­ers raw sen­sor data via Firewire, arranged in a pat­tern called Bayer Encod­ing. Color data is arranged in 2x2 cells, and there are twice the sen­sors for green than for the the other col­ors, since the human eye is more sen­si­ble to that color. We at first rearrange the input data in a strange but use­ful pat­tern, as in pic­ture. The fol­low­ing assem­bler code then does the magic, two pixel at a time.

//Loading mm0 with 0, this will be useful to interleave data byte
pxor %mm0,%mm0
 
//Loading 8 bytes from buffer. Assume %eax contains the address of the input buffer
//One out of four bytes are zeros, but the overhead is well balanced by the aligned memory access.
//Those zeros will also be useful later on
movd (%eax),%mm1 // < R1, G1, B2, 0>
movd 4(%eax),%mm2 // < B1, 0, R2, G2>
//Unpacking bytes to words, MMX registers are 8 bytes wide so we can interleave the data bytes with zeros.
punpcklbw %mm0,%mm1
punpcklbw %mm0,%mm2
 
//We need triple copy of each input, one for each output channel
movq %mm1,%mm3 // < R1, G1, B2, 0>
movq %mm2,%mm4 // < B1, 0, R2, G2>
movq %mm1,%mm5 // < R1, G1, B2, 0>
movq %mm2,%mm6 // < B1, 0, R2, G2>
 
//Multiply and accumulate, this does only half the work.
//We multiply the data with the right constants and sums the results in pair.
//The consts are four packed 16bit values and contains the constants scaled by 32768.
//[YUV]const and [YUV]const_inv are the same apart from being arrenged to suit the layout of the even/odd inputs
pmaddwd Yconst,%mm1 // < Y1*R1 + Y2*G1, Y3*B2 + 0>
pmaddwd Uconst,%mm3 // < U1*R1 + U2*G1, U3*B2 + 0>
pmaddwd Vconst,%mm5 // < V1*R1 + V2*G1, V3*B2 + 0>
 
pmaddwd Yconst_inv,%mm2 // < Y3*B1 + 0, Y1*R2 + Y2*G2>
pmaddwd Uconst_inv,%mm4 // < U3*B1 + 0, U1*R2 + U2*G2>
pmaddwd Vconst_inv,%mm6 // < V3*B1 + 0, V1*R2 + V2*G2>
 
//Add registers in pairs to get the final scalar product. The results are two packed pixel for each output channel and still scaled by 32768
paddd %mm2,%mm1 // < Y1*R1 + Y2*G1 + Y3*B1, Y1*R2, Y2*G2 + Y3*B2>
paddd %mm4,%mm3 // < U1*R1 + U2*G1 + U3*B1, U1*R2, U2*G2 + U3*B2>
paddd %mm6,%mm5 // < V1*R1 + V2*G1 + V3*B1, V1*R2, V2*G2 + V3*B2>
 
//We shift right by 15 bits to get rid of the scaling
psrad $15,%mm1
psrad $15,%mm3
psrad $15,%mm5
 
//const128 is two packed 32bit values, this is the offset to be added to the U/V channnels
//const128:
// .long 128
// .long 128
paddd const128,%mm3
paddd const128,%mm5
 
//We repack the resulting dwords to bytes
packssdw %mm0,%mm1
packssdw %mm0,%mm3
packssdw %mm0,%mm5
 
packuswb %mm0,%mm1
packuswb %mm0,%mm3
packuswb %mm0,%mm5
 
//We copy the byte pairs to the destination buffers, assume %ebx, %esi and %edi contains the address of such buffers
movd %mm1,%ecx
movw %cx,(%ebx)
movd %mm3,%ecx
movb %cl,(%esi)
movd %mm5,%ecx
movb %cl,(%edi)

Sim­ple right? :-)

Cod­ing this was dif­fi­cult but in the end really inter­est­ing. And even more impor­tant, this was really fast and we had no prob­lem using this dur­ing the robot com­pe­ti­tion itself. Read the rest of this entry »

, , , , , , ,

No Comments

Case Study: Real Time video encoding on Via Epia, Part I

Dur­ing the pESA­pod project we worked on the telecom­mu­ni­ca­tion and teleme­try sys­tem for the robot. The com­put­ing infra­struc­ture was very com­plex (maybe too much). We had three Altera FPGA on board and a very low power con­sump­tion PC, a VIA Epia board. Using devices that are light on power needs is a must for mobile robots. But we ended up using more power for the elec­tron­ics than for the motors. I guess the Altera’s boards are very heavy on power, being pro­to­typ­ing devices.

Any­way the Epia with the onboard Eden proces­sor is a very nice machine. It is fully x86 com­pat­i­ble, and we man­aged to run linux on it with­out prob­lems. It has indeed a very low power foot­print, but the per­for­mance trade­off for this was quite heavy. The orig­i­nal plan was to have four video streams from the robot. A pair of prox­im­ity cam­eras for sam­ple gath­er­ing and a stere­o­cam for nav­i­ga­tion and envi­ron­ment map­ping. We used at the end only the stere­o­cam, but even encod­ing only those two video streams on the Epia was really difficult.

We used lib­FAME for the encod­ing. The name means Fast Assem­bly MPEG Encoder. It is fast indeed, but it is also very poorly man­tained. So we had some prob­lems at firts to make it work. The library accept frames encoded in YUV for­mat, but our cam­era sen­sor data was in bayer encod­ing. So we had to write the for­mat con­ver­sion routine.

RGB to YUV using matrix notation

RGB to YUV using matrix notation

The con­ver­sion from RGB color space to YUV is quite sim­ple and can be done using lin­ear alge­bra. Our first approach was really naive and based on float­ing point.

// RGB* rgb;
// YUV* yuv;
yuv[i].y=0.299*rgb[i].r + 0.114*rgb[i].g + 0.587*rgb[i].b;
yuv[i].u=128 - 0.168736*rgb[i].r - 0.331264*rgb[i].g + 0.5*rgb[i].b;
yuv[i].v=128 + 0.5*rgb[i].r - 0.418688*rgb[i].g + 0.081312*rgb[i].b;

This was really slow. We later dis­cov­ered to our dis­ap­point­ment that the FPU was clocked at half the speed of the proces­sor. So we changed the imple­men­ta­tion to inte­ger math. The result was some­thing like this:

yuv[i].y=(299*rgb[i].r + 0.114*rgb[i].g + 0.587*rgb[i].b)/1000;
yuv[i].u=128 - (169*rgb[i].r - 331*rgb[i].g + 500*rgb[i].b)/1000;
yuv[i].v=128 + (500*rgb[i].r - 419*rgb[i].g + 81*rgb[i].b)/1000;

This solu­tion almost dou­bled the fram­er­ate. But it was still not enough and we had to dive deep in the magic world of MMX/SSE instruc­tions. The details for the next issue.

, , , , ,

2 Comments

Case Study: Real Time video encoding on Via Epia, Part I

Dur­ing the pESA­pod project we worked on the telecom­mu­ni­ca­tion and teleme­try sys­tem for the robot. The com­put­ing infra­struc­ture was very com­plex (maybe too much). We had three Altera FPGA on board and a very low power con­sump­tion PC, a VIA Epia board. Using devices that are light on power needs is a must for mobile robots. But we ended up using more power for the elec­tron­ics than for the motors. I guess the Altera’s boards are very heavy on power, being pro­to­typ­ing devices.

Any­way the Epia with the onboard Eden proces­sor is a very nice machine. It is fully x86 com­pat­i­ble, and we man­aged to run linux on it with­out prob­lems. It has indeed a very low power foot­print, but the per­for­mance trade­off for this was quite heavy. The orig­i­nal plan was to have four video streams from the robot. A pair of prox­im­ity cam­eras for sam­ple gath­er­ing and a stere­o­cam for nav­i­ga­tion and envi­ron­ment map­ping. We used at the end only the stere­o­cam, but even encod­ing only those two video streams on the Epia was really difficult.

We used lib­FAME for the encod­ing. The name means Fast Assem­bly MPEG Encoder. It is fast indeed, but it is also very poorly man­tained. So we had some prob­lems at firts to make it work. The library accept frames encoded in YUV for­mat, but our cam­era sen­sor data was in bayer encod­ing. So we had to write the for­mat con­ver­sion routine.

RGB to YUV using matrix notation

RGB to YUV using matrix notation

The con­ver­sion from RGB color space to YUV is quite sim­ple and can be done using lin­ear alge­bra. Our first approach was really naive and based on float­ing point.

// RGB* rgb;
// YUV* yuv;
yuv[i].y=0.299*rgb[i].r + 0.114*rgb[i].g + 0.587*rgb[i].b;
yuv[i].u=128 - 0.168736*rgb[i].r - 0.331264*rgb[i].g + 0.5*rgb[i].b;
yuv[i].v=128 + 0.5*rgb[i].r - 0.418688*rgb[i].g + 0.081312*rgb[i].b;

This was really slow. We later dis­cov­ered to our dis­ap­point­ment that the FPU was clocked at half the speed of the proces­sor. So we changed the imple­men­ta­tion to inte­ger math. The result was some­thing like this:

yuv[i].y=(299*rgb[i].r + 0.114*rgb[i].g + 0.587*rgb[i].b)/1000;
yuv[i].u=128 - (169*rgb[i].r - 331*rgb[i].g + 500*rgb[i].b)/1000;
yuv[i].v=128 + (500*rgb[i].r - 419*rgb[i].g + 81*rgb[i].b)/1000;

This solu­tion almost dou­bled the fram­er­ate. But it was still not enough and we had to dive deep in the magic world of MMX/SSE instruc­tions. The details for the next issue.

, , , , ,

2 Comments

Let’s join the information stream

Hello world,

Let’s intro­duce our­selves. We are a bunch of stu­dents at Scuola Supe­ri­ore Sant’Anna in Pisa, Italy. We often start (less often fin­ish) a lot of projects here. This is pri­mar­ily a place for us to write down our ideas. Maybe some­one out there could find them use­ful as well.

Our main inter­ests are linked to com­put­ing, pro­gram­ming and secu­rity. But maybe other top­ics will be touched, who knows...

See you soon

No Comments

Let’s join the information stream

Hello world,

Let’s intro­duce our­selves. We are a bunch of stu­dents at Scuola Supe­ri­ore Sant’Anna in Pisa, Italy. We often start (less often fin­ish) a lot of projects here. This is pri­mar­ily a place for us to write down our ideas. Maybe some­one out there could find them use­ful as well.

Our main inter­ests are linked to com­put­ing, pro­gram­ming and secu­rity. But maybe other top­ics will be touched, who knows...

See you soon

No Comments