Archive for category Coding tricks

A tale of hidden symbols, weakness and gold

A few days ago I was pro­fil­ing the startup time of the clang com­piler. The call­grind report high­lighted that a rel­a­tively high amount of time was being spent in the _dl_lookop_symbol_x. A quick gdb inspec­tion of the call stack quickly pointed out that the dynamic linker was spend­ing the time doing dynamic relo­ca­tions. Tons of them.

Using objdump -R clang I found out that a whop­ping 42% of them actu­ally con­tained the word “clang” in the man­gled sym­bol names. Deman­gling the names made it clear that they were mostly def­i­n­i­tions of the C++ vir­tual tables for clang inter­nal classes.

Basi­cally 42% of the relo­ca­tions hap­pen­ing at run time were actu­ally look­ing for sym­bols which are obvi­ously defined inside the clang exe­cutable itself. So what’s hap­pen­ing here?

Turns out the prob­lem is a poor inter­ac­tion between the tra­di­tional linker (ld, from the binu­tils pack­age) and C++‘s ODR rule.

ODR stands for One Def­i­n­i­tion Rule and says that any entity in C++ code must be defined only once. This is obvi­ously triv­ial for meth­ods, since if they are defined twice, even in dif­fer­ent trans­la­tion units, there will be an error at link time. There are, though, a few things that are implic­ity defined by the com­piler, for exam­ple the vir­tual tables, which are defined when­ever a class using vir­tual meth­ods is declared. (Here I’m speak­ing infor­mally, I sus­pect the ter­mi­nol­ogy is not 100% cor­rect) Since the vtable is poten­tially defined in more than a trans­la­tion unit, the sym­bols for the vtable are flagged as weak. Weak sym­bols will not con­flict with each other and they will be all dis­carded but one (any one is fine). The sur­viv­ing one will be used by the end prod­uct of the linker.

Unfor­tu­nately, the com­piler does not know if the com­piled object file will be used as part of a dynamic library or as part of a main exe­cutable. This means that it has to treat the vtable sym­bols like any library method, since (if the tar­get is a dynamic library) poten­tially such sym­bols may be over­rid­den by another def­i­n­i­tion in a library or in the main exe­cutable. This has to hap­pen since the ODR rule must also apply across the library bor­ders for proper sup­port of a few things, espe­cially excep­tion handling.

So at com­pile time there is no way around using dynamic relo­ca­tion on the vtable sym­bols. A pos­si­ble workaroud would be to com­pile all the code with the -fvisibility=hidden flag. Unfor­tu­nately this is actu­ally wrong since it may break the ODR rule!

At link time the linker has the chance of elim­i­nat­ing the dynamic relo­ca­tion, since it does know if the tar­get is a main exe­cutable and sym­bols defined in an exe­cutable can­not be over­rid­den by any­thing (not even by LD_PRELOADed libraries). Unfor­tu­nately the tra­di­tional ld linker does not apply this optimization.

The new gold linker, orig­i­nally devel­oped at Google, does tough! It is able to com­pletely erad­i­cate the dynamic relo­ca­tions to inter­nally defined sym­bols, effec­tively reduc­ing the load time.

Moral of the tale: use the gold linker. It should work in most cases (I think ker­nel code is a notable excep­tion) and gen­er­ate faster exe­cuta­bles while con­sum­ing less mem­ory and cpu time dur­ing linking.

And please, dear debian/ubuntu main­tain­ers, link clang using gold.

1 Comment

The quest for graphics performance: part II

I’d like to talk a bit about the archi­tec­ture I’ve using to effi­ciently ren­der the video stream in Lightspark. As often hap­pens the key in high per­for­mance com­put­ing is using the right tools for each job. First of all video decod­ing and ren­der­ing are asyn­chro­nous and exe­cuted by dif­fer­ent threads.

Decod­ing itself it’s done by the widely known FFM­peg, no spe­cial tricks are played here. So the start­ing con­di­tion of the opti­mized fast path is a decoded frame data struc­ture. The data struc­ture is short lived and it is over­writ­ten by the next decoded frame, so it must be copied to a more sta­ble buffer. The decod­ing thread main­tains a short array of decoded frames ready to be ren­dered, to account for vari­ance in the decod­ing delay. The decoded frame is in YUV420 for­mat, this means that res­o­lu­tion of color data is one half of the res­o­lu­tion of the lumi­nance chan­nel. The data is returned by FFm­peg as 3 dis­tinct buffers, one for each of the YUV chan­nels, so we actu­ally save 3 buffers per frame. This copy is nec­es­sary and it’s the only one that will be done on the data.

Ren­der­ing is done using a tex­tured quad and tex­ture data is loaded using OpenGL Pixel buffer objects (PBOs). PBOs are mem­ory buffers man­aged by the GL and it’s pos­si­ble to load tex­ture data from them. Unfor­tu­nately they must be explic­itly mapped to the client address space to be accessed, and unmapped when the updated. The advan­tage is that data trans­fer between PBOs and video or tex­ture mem­ory will be done by the GL using asyn­chro­nous DMA trans­fers. Using 2 PBOs it’s pos­si­ble to guar­an­tee a con­tin­u­ous stream of data to video mem­ory: while one PBOs is being copied to tex­ture mem­ory by DMA, new data is been com­puted and trans­ferred to the other using the CPU. This usage pat­tern is called stream­ing tex­tures.

In this case such data is the next frame, taken from the decoded frames buffer. Tex­tures data for OpenGL must be pro­vided in pla­nar form. So we must pack a 1-buffer-per-channel frame in a sin­gle buffer. This can be done in a zero-copy fash­ion using instruc­tion pro­vided by the SSE2 exten­sion. Data is loaded in 128 bit chunks from each of the Y, U and V chan­nels, then using reg­is­ter only oper­a­tions it is cor­rectly packed and padded. Results are writ­ten back using non-temporal moves. This means that the proces­sor may feel free to post­pone the actual com­mit­ment of data to mem­ory, for exam­ple to exploit burst trans­fers on the bus. If we ever want to be sure that the changes has been com­mit­ted in mem­ory we have to call the sfence instruc­tion. For more infor­ma­tion see the Intel ref­er­ence man­u­als on movapd, movntpd, sfence, pun­pcklb.

The result is a sin­gle buffer with the for­mat YUV0, padding has been added to increase tex­ture trans­fer effi­ciency, as the video cards inter­nally works with 32-bit data any­way. The des­ti­na­tion buffer is one of the PBOs so, at the end of the con­ver­sion rou­tine, data will be trans­ferred to video mem­ory using DMA.

Using the stream­ing tex­ture tech­nique and SSE2 data pack­ing we man­aged to effi­ciently move the frame data to tex­ture mem­ory, but it’s still in YUV for­mat. Con­ver­sion to the RGB color space is basi­cally a lin­ear alge­bra oper­a­tion, so it’s ideal to offload this com­pu­ta­tion to a pixel shader.

No Comments

Allocating aligned memory

Just a quick note that may be use­ful to some­one else. As you may know SSE2 intro­duced a new instruc­tion: MOVDQA (MOVe Dou­ble Quad­word Aligned). This is used to move 128 bit (16 bytes) of data from/to memory/xmm reg­is­ters. This instruc­tion only works if the data is aligned the the 16 byte bound­ary. There is also another instruc­tion for the unaligned case, but the aligned ver­sion is way faster. So let’s sum­ma­rize some tech­niques to get an aligned mem­ory area

  • For local, sta­tic and mem­ber vari­ables you can append __attribute__ (( aligned (16 ) ) to the type def­i­n­i­tion. Example:
  • struct A { int val; } __attribute__ ((aligned ( 16 ) );

  • For dynam­i­cally allo­cated mem­ory the usual mal­loc is not enough, but there is a posix_memalign which has the seman­tics that we need. It is defined as:
  • int posix_memalign(void **memptr, size_t alignment, size_t size);

So we have to pass a pointer to the pointer that will receive the newly allo­cated mem­ory, the required align­ment (which has to be a power of two) and the allo­ca­tion size. Mem­ory allo­cated this way can (at least on the glibc imple­men­ta­tion) be freed using the usual free func­tion.

, , , , , ,

No Comments

The quest for graphics performance: part I

lightspark-tech-demo2-revamp

Devel­op­ing and opti­miz­ing Lightspark, the mod­ern Flash player, I’m greatly expand­ing my knowl­edge and under­stand­ing of GPU inter­nals. In the last few days I’ve man­aged to find out a cou­ple of nice tricks that boosted up per­for­mance and, as nice side effects, saved CPU time and added features :-)

First of all, I’d like to intro­duce a bit of the Lightspark graph­ics architecture

The project is designed from the ground up to make use of the fea­tures offered by mod­ern graph­ics hard­ware. Namely 3D accel­er­a­tion and pro­gram­ma­ble shaders. The Flash file for­mat encodes the geome­tries to be drawn as set of edges. This rep­re­sen­ta­tion is quite dif­fer­ent from the one under­stood by GPUs. So the geome­tries are first tri­an­gu­lated (reduced to a set of tri­an­gles). This oper­a­tion is done on the CPU and is quite com­pu­ta­tion­ally inten­sive, but the results are cached, so over­all this does not hit performance.

More­over Flash offer sev­eral dif­fer­ent fill styles that should be applied on geom­e­try, for exam­ple solid color and var­i­ous kind of gra­di­ents. Lightspark han­dles all those pos­si­bil­i­ties using a sin­gle frag­ment shader, a lit­tle piece of code that is invoked on every pixel to com­pute the desired color. Of course the shader has to know about the cur­rent fill style. This infor­ma­tion along with sev­eral other para­me­ters could be passed with dif­fer­ent meth­ods. More on this on the next issue.

There is one pecu­liar thing about the shader though, let’s look at a sim­ple pseudo code:
gl_FragColor=solid_color()*selector[0]+linear_gradient()*selector[1]+circular_gradient()*selector[2]...;

Selec­tor is a binary array, the only allowed val­ues are zero or one. More­over only one value is one. This means that the cur­rent frag­ment (pixel) color is com­puted for every pos­si­ble fill style and only after­ward the cor­rect result is selected. This may look like  a waste of com­put­ing power, but it is actu­ally more effi­cient than some­thing like this:

if(selector[0])
       gl_FragColor=solid_color();
else if(selector[1])
       gl_FragColot=linear_gradient();
...

This counter intu­itive fact comes from the nature of the graph­ics hard­ware. GPUs are very dif­fer­ent from CPUs and are capa­ble of cruch­ing tons of vec­to­r­ial oper­a­tions blind­ingly fast. But they totally fall down on their knees when encoun­ter­ing branches in the code. This is actu­ally quite com­mon on deeply pipelined archi­tec­ture which misses com­plex branch pre­dic­tion cir­cuitry, not only GPUs but also num­ber crunch­ing devices and mul­ti­me­dia mon­sters such as IBM Cell. Keep this in mind when work­ing on such platforms.

, , , , , , ,

2 Comments

Three GDB tricks for the masses

The GNU Debug­ger is a very pow­er­ful and fear­some beast. It seems that I’ve found a cou­ple of use­ful tricks which are not pop­u­lar as they should.

  • The watch com­mand: any script kid­die knows about break­points. Watch­points are based on the same con­cept applied to not to code, but to data mem­ory. You can set a watch­point on a vari­able using the fol­low­ing syn­tax: watch var. It is also pos­si­ble to watch a spe­cific mem­ory address with the syn­tax: watch *0xdeadbeaf. A watch­point trig­gers when the con­tents of the mem­ory loca­tion changes, so they are use­ful when try­ing to find out when a vari­able is mod­i­fied. (remem­ber that in C++ you should always use pri­vate and pro­tected sec­tions to stop vari­ables from being accessed/modified out­side the expected code flow). It is also pos­si­ble to set read watch­points using the rwatch com­mand, which trig­gers when the loca­tion is accessed. This fea­ture is mainly use­ful when reverse engi­neer­ing com­piled code, to find out which code path make use/are influ­enced by a cer­tain variable/memory loca­tion. The major draw­back of watch­points is that com­mon hard­ware sup­port only a few of them. When gdb runs out of hard­ware watch­points it resorts to soft­ware emu­la­tion, which is very slow, and not pos­si­ble at all for the read ones. This also means that putting watch­points on big struc­tures and classes is highly discorauged.
  • set print object: If you’ve ever made exten­sive use of poly­mor­phic classes or oth­er­wise com­plex class hier­ar­chies, you will find this set­ting very use­ful. The print com­mand, when print­ing point­ers of base classes, will take look at the vir­tual table to find out the actual object type, and print it beside the address. More­over when deref­er­enc­ing point­ers it will print the full object con­tents, not only the base class members.
  • info threads com­mand: When you face a dead­lock in your mul­ti­threaded code, the first step to under­stand the prob­lem is to find out which threads are blocked. The dead­lock is often not eas­ily repro­ducible, so should at first attach gdb the the process using the attach <pid> com­mand, the typ­ing info threads you will get for each thread the func­tion name where it is cur­rently stopped. Threads involved in a dead­lock con­di­tions are usu­ally stopped in sem_wait or sim­i­lar sem­a­phore related functions

Well... I guess I’m using gdb way too much, maybe I’m not a good programmer :-)

, ,

No Comments

Fun with ptrace: system emulation

penguin_internals The more I explore the inter­faces deeply hid­den in the linux ker­nel, the more a new world of oppor­tu­nity opens. Today, while tak­ing a look at the ptrace API, I found out the PTRACE_SYSEMU option.

But what is ptrace? It’s a ker­nel inter­face to check and manip­u­late the infor­ma­tion that crosses the user space-kernel space fron­tier. Its main... prin­ci­pal... only user is gdb usu­ally. The PTRACE_SYSEMU option is quite pecu­liar, it was imple­mented mainly for the user mode linux project. It allows not only to mon­i­tor the sys­tem calls invoked by a process, but also to replace the sys­tem call sematics.

So... how could this be use­ful? For exam­ple to exper­i­ment with dif­fer­ent auditing/sandboxing strate­gies, or to build com­patil­ity lay­ers at the sys­tem call level... but who knows what kind of funny things could be done!

, ,

No Comments

A Nice Little Book of Semaphores

We’re cur­rently busy work­ing on assign­ments for the inter­nal exams here at Sant’Anna School1 and we (well, Mirko) found a nice book for our par­al­lel pro­gram­ming project. It’s “A Lit­tle Book of Sem­a­phores”, freely avail­able here.

It’s a very nice col­lec­tion of con­cur­rency prob­lems, it deals with pretty much every one you could think of. So thank you pro­fes­sor Downey!

  1. Well, we’re also work­ing on the usual stuff (ways to make big money, strange web 2.0 appli­ca­tions, high-performance-emulators, big robots, small robots, our many-times-acclaimed edi­tor for Wikipedia, world dom­i­na­tion, ...). []

, , ,

No Comments

Case Study: Real Time video encoding on Via Epia, Part II

Once upon a time, there was the glo­ri­ous empire of DOS. It was a mighty and fear­ful time, when peo­ple could still talk to the heart of the machines and write code in the for­got­ten lan­guage of assem­bly. We are now lucky enough to have pow­er­ful com­pil­ers that do most of this low level work for us, and hand craft­ing assem­bly code is not needed any­omore. But the intro­duc­tion of SIMD (Sin­gle Instruc­tion Mul­ti­ple Data) within the x86 instruc­tion set made this ancient abil­ity use­ful again.

MMX/SSE are a very pow­er­ful and dan­ger­ous beast. We had almost no pre­vi­ous expe­ri­ence with low level assem­bly pro­gram­ming. And a crit­i­cal prob­lem to solve: how to con­vert from RGB col­or­space to YUV, and do it fast on our very lim­ited board.
As I’ve wrote on the pre­vi­ous arti­cle the con­ver­sion is con­cep­tu­ally sim­ple and it’s basi­cally a 3x3 matrix mul­ti­pli­ca­tion. That’s it, do 9 scalar prod­ucts and you’re done!

vect11

SIMD instruc­tions oper­ate on packed data: this means that more than one (usu­ally 2/4) value is stored in a sin­gle reg­is­ter and oper­a­tions on them are par­al­lelized. For exam­ple you can do four sums with a sin­gle oper­a­tion.
Unfor­tu­nately MMX/SSE is a ver­ti­cal instruc­tion set. This means you can do very lit­tle with the data packed in a sin­gle reg­is­ter. There are how­ever instruc­tions that do ‘half a scalar prod­uct’. We found out an approach to max­i­mize the through­put using this.

Our cam­era, a Point­grey Bum­ble­bee, deliv­ers raw sen­sor data via Firewire, arranged in a pat­tern called Bayer Encod­ing. Color data is arranged in 2x2 cells, and there are twice the sen­sors for green than for the the other col­ors, since the human eye is more sen­si­ble to that color. We at first rearrange the input data in a strange but use­ful pat­tern, as in pic­ture. The fol­low­ing assem­bler code then does the magic, two pixel at a time.

//Loading mm0 with 0, this will be useful to interleave data byte
pxor %mm0,%mm0
 
//Loading 8 bytes from buffer. Assume %eax contains the address of the input buffer
//One out of four bytes are zeros, but the overhead is well balanced by the aligned memory access.
//Those zeros will also be useful later on
movd (%eax),%mm1 // &lt; R1, G1, B2, 0&gt;
movd 4(%eax),%mm2 // &lt; B1, 0, R2, G2&gt;
//Unpacking bytes to words, MMX registers are 8 bytes wide so we can interleave the data bytes with zeros.
punpcklbw %mm0,%mm1
punpcklbw %mm0,%mm2
 
//We need triple copy of each input, one for each output channel
movq %mm1,%mm3 // &lt; R1, G1, B2, 0&gt;
movq %mm2,%mm4 // &lt; B1, 0, R2, G2&gt;
movq %mm1,%mm5 // &lt; R1, G1, B2, 0&gt;
movq %mm2,%mm6 // &lt; B1, 0, R2, G2&gt;
 
//Multiply and accumulate, this does only half the work.
//We multiply the data with the right constants and sums the results in pair.
//The consts are four packed 16bit values and contains the constants scaled by 32768.
//[YUV]const and [YUV]const_inv are the same apart from being arrenged to suit the layout of the even/odd inputs
pmaddwd Yconst,%mm1 // &lt; Y1*R1 + Y2*G1, Y3*B2 + 0&gt;
pmaddwd Uconst,%mm3 // &lt; U1*R1 + U2*G1, U3*B2 + 0&gt;
pmaddwd Vconst,%mm5 // &lt; V1*R1 + V2*G1, V3*B2 + 0&gt;
 
pmaddwd Yconst_inv,%mm2 // &lt; Y3*B1 + 0, Y1*R2 + Y2*G2&gt;
pmaddwd Uconst_inv,%mm4 // &lt; U3*B1 + 0, U1*R2 + U2*G2&gt;
pmaddwd Vconst_inv,%mm6 // &lt; V3*B1 + 0, V1*R2 + V2*G2&gt;
 
//Add registers in pairs to get the final scalar product. The results are two packed pixel for each output channel and still scaled by 32768
paddd %mm2,%mm1 // &lt; Y1*R1 + Y2*G1 + Y3*B1, Y1*R2, Y2*G2 + Y3*B2&gt;
paddd %mm4,%mm3 // &lt; U1*R1 + U2*G1 + U3*B1, U1*R2, U2*G2 + U3*B2&gt;
paddd %mm6,%mm5 // &lt; V1*R1 + V2*G1 + V3*B1, V1*R2, V2*G2 + V3*B2&gt;
 
//We shift right by 15 bits to get rid of the scaling
psrad $15,%mm1
psrad $15,%mm3
psrad $15,%mm5
 
//const128 is two packed 32bit values, this is the offset to be added to the U/V channnels
//const128:
// .long 128
// .long 128
paddd const128,%mm3
paddd const128,%mm5
 
//We repack the resulting dwords to bytes
packssdw %mm0,%mm1
packssdw %mm0,%mm3
packssdw %mm0,%mm5
 
packuswb %mm0,%mm1
packuswb %mm0,%mm3
packuswb %mm0,%mm5
 
//We copy the byte pairs to the destination buffers, assume %ebx, %esi and %edi contains the address of such buffers
movd %mm1,%ecx
movw %cx,(%ebx)
movd %mm3,%ecx
movb %cl,(%esi)
movd %mm5,%ecx
movb %cl,(%edi)

Sim­ple right? :-)

Cod­ing this was dif­fi­cult but in the end really inter­est­ing. And even more impor­tant, this was really fast and we had no prob­lem using this dur­ing the robot com­pe­ti­tion itself. Read the rest of this entry »

, , , , , , ,

No Comments

Case Study: Real Time video encoding on Via Epia, Part I

Dur­ing the pESA­pod project we worked on the telecom­mu­ni­ca­tion and teleme­try sys­tem for the robot. The com­put­ing infra­struc­ture was very com­plex (maybe too much). We had three Altera FPGA on board and a very low power con­sump­tion PC, a VIA Epia board. Using devices that are light on power needs is a must for mobile robots. But we ended up using more power for the elec­tron­ics than for the motors. I guess the Altera’s boards are very heavy on power, being pro­to­typ­ing devices.

Any­way the Epia with the onboard Eden proces­sor is a very nice machine. It is fully x86 com­pat­i­ble, and we man­aged to run linux on it with­out prob­lems. It has indeed a very low power foot­print, but the per­for­mance trade­off for this was quite heavy. The orig­i­nal plan was to have four video streams from the robot. A pair of prox­im­ity cam­eras for sam­ple gath­er­ing and a stere­o­cam for nav­i­ga­tion and envi­ron­ment map­ping. We used at the end only the stere­o­cam, but even encod­ing only those two video streams on the Epia was really difficult.

We used lib­FAME for the encod­ing. The name means Fast Assem­bly MPEG Encoder. It is fast indeed, but it is also very poorly man­tained. So we had some prob­lems at firts to make it work. The library accept frames encoded in YUV for­mat, but our cam­era sen­sor data was in bayer encod­ing. So we had to write the for­mat con­ver­sion routine.

RGB to YUV using matrix notation

RGB to YUV using matrix notation

The con­ver­sion from RGB color space to YUV is quite sim­ple and can be done using lin­ear alge­bra. Our first approach was really naive and based on float­ing point.

// RGB* rgb;
// YUV* yuv;
yuv[i].y=0.299*rgb[i].r + 0.114*rgb[i].g + 0.587*rgb[i].b;
yuv[i].u=128 - 0.168736*rgb[i].r - 0.331264*rgb[i].g + 0.5*rgb[i].b;
yuv[i].v=128 + 0.5*rgb[i].r - 0.418688*rgb[i].g + 0.081312*rgb[i].b;

This was really slow. We later dis­cov­ered to our dis­ap­point­ment that the FPU was clocked at half the speed of the proces­sor. So we changed the imple­men­ta­tion to inte­ger math. The result was some­thing like this:

yuv[i].y=(299*rgb[i].r + 0.114*rgb[i].g + 0.587*rgb[i].b)/1000;
yuv[i].u=128 - (169*rgb[i].r - 331*rgb[i].g + 500*rgb[i].b)/1000;
yuv[i].v=128 + (500*rgb[i].r - 419*rgb[i].g + 81*rgb[i].b)/1000;

This solu­tion almost dou­bled the fram­er­ate. But it was still not enough and we had to dive deep in the magic world of MMX/SSE instruc­tions. The details for the next issue.

, , , , ,

2 Comments