Lightspark gets video streaming

Just a brief news. It’s been a long way, and today I’m very proud to announce video stream­ing sup­port for Lightspark, the effi­cient open source flash player. More­over, per­for­mance looks very promis­ing, I’m not going to pub­lish any results right now as I’d like to do some test­ing before. Any­way, Lightspark seems to be out­per­form­ing Adobe’s player by a good extent, at least on linux.

In the next post I’ll talk a bit about some per­for­mance tricks that made it pos­si­ble to reach such result.

, , , ,

1 Comment


Lightspark’s news

Lightspark pro­gresses are never been so good. The last achieve­ment was to cor­rectly load, exe­cute and par­tially ren­der the YouTube player. As you may have seen YouTube has recently switched to Flash 10 and Action­Script 3.0 to serve some HD con­tent, while keep­ing the old AS2 based player for lower qual­ity videos. The old player is sup­ported by Gnash but, until now, there where no open source alter­na­tives to play newer, high def­i­n­i­tion con­tent. As Lightspark AS3 engine matures, that gap is almost closed. Stay tuned, as I’m plan­ning to release a new tech­nol­ogy demo very soon.

UPDATE: Demo tar­ball released on source­forge

, ,

No Comments


Extreme FLEXibility

Although there has been no offi­cial news about Lightspark for sev­eral months, i’ve been doing a great deal of work under the hood. As my bach­e­lor the­sis, I’ve mostly com­pleted and throughly tested my LLVM based Action­script 3 .0 JIT engine and, dur­ing the last days, I’ve been work­ing on pol­ish­ing a bit the Vir­tual Machine. I’m proudly announc­ing that, in some days, a new tech­ni­cal demo of Lightspark will be released, but this time we’re not talk­ing about basic exam­ples. Lightspark is now mature enough to run a sim­ple appli­ca­tion based on Flex.

Flex is a rich open source frame­work writ­ten in Action­script and devel­oped mainly by Adobe. Even if right now the test appli­ca­tion fea­tures only a progress bar and a square, there is a lot of stuff being done by the frame­work under the hood.

If the frame­work works it means that the engine is now sta­ble enough to move from a pre-alpha to an alpha sta­tus. The design is also now sat­is­fy­ing enough for me to allow other peo­ple to join the project and work on on sub­sys­tems with­out know­ing the inter­nal details of every­thing. As an added bonus pre­lim­i­nary sup­port for the Win­dows plat­form will be included in the release.

The screen­shot above is the result of the exe­cu­tion of my test appli­ca­tion, for curi­ous peo­ple the flash file is gen­er­ated using the mxmlc com­piler, from the fol­low­ing source file

<?xml version="1.0" encoding="utf-8"?>
<mx:Application
xmlns:mx="http://www.adobe.com/2006/mxml"
horizontalAlign="center" verticalAlign="middle">
<mx:VBox x="0" y="0" width="201" height="200" backgroundColor="0x0080C0" alpha="0.8"/>
</mx:Application>

, , ,

No Comments


Allocating aligned memory

Just a quick note that may be use­ful to some­one else. As you may know SSE2 intro­duced a new instruc­tion: MOVDQA (MOVe Dou­ble Quad­word Aligned). This is used to move 128 bit (16 bytes) of data from/to memory/xmm reg­is­ters. This instruc­tion only works if the data is aligned the the 16 byte bound­ary. There is also another instruc­tion for the unaligned case, but the aligned ver­sion is way faster. So let’s sum­ma­rize some tech­niques to get an aligned mem­ory area

  • For local, sta­tic and mem­ber vari­ables you can append __attribute__ (( aligned (16 ) ) to the type def­i­n­i­tion. Example:
  • struct A { int val; } __attribute__ ((aligned ( 16 ) );

  • For dynam­i­cally allo­cated mem­ory the usual mal­loc is not enough, but there is a posix_memalign which has the seman­tics that we need. It is defined as:
  • int posix_memalign(void **memptr, size_t alignment, size_t size);

So we have to pass a pointer to the pointer that will receive the newly allo­cated mem­ory, the required align­ment (which has to be a power of two) and the allo­ca­tion size. Mem­ory allo­cated this way can (at least on the glibc imple­men­ta­tion) be freed using the usual free func­tion.

, , , , , ,

No Comments


Talking about CPU architectures: Then it came assembly chain

As we’ve seen in the pre­vi­ous essay in a sim­ple proces­sor design each instruc­tion needs sev­eral cycles to be com­pleted. Each instruc­tion passes sev­eral steps: first it’s loaded from mem­ory (FETCH0 state), then it’s decoded (FETCH1 state) and the appro­pri­ate code path is selected to com­pute the oper­a­tion results. In the my exam­ple I’ve not included other two essen­tial steps: operands fetch­ing and results stor­ing on mem­ory. This sequen­tial approach has a big draw­back, as for most of the time the CPU cir­cuits (such as the mem­ory inter­face or the ALU) are not doing use­ful work. If on aver­age each instruc­tion takes five steps to com­plete, then the proces­sor is only exe­cut­ing one instruc­tion every five cycles. To solve this prob­lem we can intro­duce the pipeline.

The pipeline con­cept is bor­rowed from the indus­trial assem­bly line. The basic idea is to decou­ple the sequen­tial steps of each instruc­tion by insert­ing buffer reg­is­ters in between. Each stage of the pipeline is iso­lated, and on each clock cycle it can take the input from the reg­is­ter upstream and out­put the results into the reg­is­ter down­stream. Of course not every instruc­tion needs pro­cess­ing by all stages (for exam­ple it may or may not access mem­ory) and so each stage has to have some kind of No Oper­a­tion mode to pass trough the unmod­i­fied input reg­is­ter to the output.

A sim­ple pseudo-verilog descrip­tion of a pipelined proces­sor fol­lows. We are assum­ing that the instruc­tion and data mem­ory are phys­i­cally sep­a­rate: an Havard like archi­tec­ture. This is not true in cur­rent main­stream archi­tec­tures, as both instruc­tion and data are stored in RAM, but may be con­sid­ered true to a large extend if there is a fast instruc­tion cache. More­over it has to be noted that although the access_memory mod­ule is ref­er­enced both in the input fetch­ing and out­put stor­ing stages, it can only cor­rectly serve one request at a time, and so some kind of arbi­tra­tion has to be used. We can arbi­trar­ily decide that reads are pri­or­i­tized over write, but in case of a col­li­sion the oper­a­tion will be for sure not cor­rect, though safe for the elec­tron­ics. The register_file mod­ule is just a mul­ti­plexer for the reg­is­ters, it will out­put the value of the reg­is­ters for the pro­vided indexes. The mod­ule it’s used both in the operands fetch­ing and write­back stages, but it’s safe because it allows two reg­is­ter reads and one write and the same time (three way access).

This toy proces­sor is inspired from MIPS archi­tec­ture and as you can see it uses fixed length instruc­tions. But I’ve also tried to design some­thing more sim­i­lar to an x86 proces­sor. The instruc­tion set (which I’m too lazy to define exten­sively) could allow both register-register and register-memory oper­a­tion, and out­put to either mem­ory or reg­is­ters. No indi­rect mem­ory access is allowed, sorry. As the instruc­tions are four byte long prob­a­bly only a 16 or 24 bit flat address space could be accessed.


always @(posedge clock)
begin
//
//Instructrion fetch stage
FETCHED_INSTRUCTION <= get_instruction_memory(PC); PC <= PC+4;
//
//Instruction decoding stage
{OPCODE_0, OPERAND1_TYPE, REG1_INDEX, REG2_INDEX, MEM_ADDRESS_0, WRITE_TO_MEMORY_0, DEST_REG_0} <= decode_instruction(FETCHED_INSTRUCTION);
//
//Operands fecthing stage, some register will be forwarded to the next stages
OPERAND1 <= (OPERAND1_TYPE==REG) ?
read_register_file(REG1_INDEX, REG2_INDEX)[31:0] :
access_memory(MEM_ADDRESS_0,READ,UNSPECIFIED);
OPERAND2 <= register_file(REG1_INDEX,REG2_INDEX)[63:32];
MEM_ADDRESS_1 <= MEM_ADDRESS_0;
WRITE_TO_MEMORY_1 <= WRITE_TO_MEMORY_0;
DEST_REG_1 <= DEST_REG_0;
OPCODE_1 <= OPCODE_0;
//
//Execution stage
RESULT <= alu(OPCODE_1,OPERAND1,OPERAND2);
MEM_ADDRESS_2 <= MEM_ADDRESS_1;
WRITE_TO_MEMORY_2 <= WRITE_TO_MEMORY_1;
DEST_REG_2 <= DEST_REG_1;
//
//Writeback stage
(WRITE_TO_MEMORY_2) ? access_memory(MEM_ADDRESS_2,WRITE,RESULT) : write_register_file(DEST_REG_2, RESULT);
end

Sev­eral con­sid­er­a­tions can be made on this design. First of all extreme care is needed from the com­piler (or the user, if it’s crazy enough to write assem­bly man­u­ally), because there are sev­eral haz­ards, not only as said pre­vi­ously on con­cur­rent read/write mem­ory access, but also on reg­is­ters. As the result of an instruc­tion is writ­ten back to the reg­is­ter file only at the end of the pipeline it’s not legal to access a result before five more instruc­tion has been exe­cuted. To guar­an­tee mem­ory access valid­ity it’s enough to avoid issu­ing a mem­ory write two instruc­tion after a mem­ory load. If no use­ful instruc­tion can be issued dur­ing this forced wait­ing, a NOP should be inserted.

Real world archi­tec­ture employs inter­lock­ing tech­niques which takes care of stop­ping (stalling) the pipeline when an haz­ard could occur. I’ll describe stalling in more detail in another essay. Any­way if a pipeline stage, such us the ALU for com­pli­cated com­pu­ta­tion (for exam­ple inte­ger divi­sion), needs more than a cycle to com­plete it will stall the provi­ous stages of the pipeline, and insert so called bub­bles into the later stage. Bub­bles are basi­cally safe No Oper­a­tion input states, the NOP opcode in any archi­tec­ture is just an explicit pipeline bub­ble. Even in mod­ern inter­locked archi­tec­ture cor­rect instruc­tion sched­ul­ing is extremely use­ful to gain per­for­mance as pipeline bub­bles actu­ally waste exe­cu­tion time.

That’s all for now, in the pre­vi­ous post I for­got to men­tion the ever fail­ing, but very nice Very Long Instruc­tion Word (VLIW) approach. I’m of course going to talk about that to sometime.

, , , , , , , , ,

No Comments


Talking about CPU architectures: The beginning

Some days ago I was talk­ing about mul­ti­core sys­tems and mod­ern CPU tech­nolo­gies with a younger engi­neer­ing stu­dent at Scuola Supe­ri­ore Sant’Anna. I was quite sur­prised by how much mis­con­cep­tions and hype sur­rounds this topic, even for peo­ple that actu­ally study com­puter engi­neer­ing. Espe­cially con­sid­er­ing that the courses of the Uni­ver­sity of Pisa are quite good when it comes to low level hard­ware stuff (com­puter engi­neer­ing was born by branch­ing elec­tronic engi­neer­ing, and the legacy is still quite evident).

So I’ll try to write some arti­cles about CPU archi­tec­tures, from the basic to advanced top­ics... well... as advanced as my cur­rent knowl­edge goes, still bet­ter than nothing.

Ok, for this essay I’ll basi­cally ref­er­ence the sim­ple proces­sor design that we study in our awe­some Logic Cir­cuits course (thanks Prof. Corsini, even if I’ve not seen a sin­gle class of it). The code pre­sented is writ­ten is some­thing hope­fully sim­i­lar to Ver­ilog, the hard­ware descrip­tion language.

This proces­sor is basi­cally a sim­ple state machine. The inter­nal state is stored in some syn­chro­nous reg­is­ter, usu­ally imple­mented using so called edge-triggered flip-flops. There is a clock source which gen­er­ates a square wave. The clock is used as an input to reg­is­ters, which are allowed to get a new value only on the rais­ing edge of the wave. The new value of the reg­is­ters is usu­ally com­puted by state­less asyn­chro­nous circuits.

PC adder

A sim­ple cir­cuit to get the address of the next instruc­tion: the new address is com­puted dur­ing the clock cycle by the asyn­chro­nous adder and assigned at the next ris­ing edge of the clock

This means that the out­put depends only on the input, and the clock sig­nal is not used. An exam­ple of this kind of cir­cuits could be the adder, or the more com­plex ALU. Those cir­cuits takes some time to com­plete their work, the adder, for exam­ple, has to prop­a­gate the carry infor­ma­tion from the least sig­nif­i­cant bit to the most sig­nif­i­cant bit. Before the com­pu­ta­tion is com­plete the out­put state under­goes sev­eral spu­ri­ous tran­si­tions (add gif) which will be ignored by the fol­low­ing reg­is­ter, as the new value will only be cap­tured on the rais­ing edge of the clock.

This allows us to under­stand two concepts:

  • The faster the clock is, the more new states can be com­puted, and so the more oper­a­tions can be done. So this is what the ven­dors are talk­ing about when they mar­ket they sev­eral GhZ CPUs.
  • Why can’t the clock be infi­nitely fast? Because to guar­an­tee that all the spu­ri­ous tran­si­tions from asyn­chro­nous cir­cuits will be dis­carted the clock cycle time has to be at least a lit­tle longer than the longest set­tle time of those circuits.

For a long time dur­ing the pre­vi­ous years it was easy to pump up the clock speeds, from the 4 Mhz of the first 8088 to the 4 Ghz of the last Pen­tiums, as the progress in CMOS techlo­gies allowed to make the cir­cuits smaller and smaller, which also means basi­cally faster and faster. But after a while this had to stop, even if tran­sis­tors are still get­ting smaller. The main prob­lem right now is that power con­sump­tion and heat pro­duc­tion go with the square of the clock fre­quency, which poses a very heavy upper limit to the rea­son­able clock speed, if you’re not will­ing to use liq­uid refrigeration.

This simple 3 bit adder needs some time to add 011 (3) and 001 (1) beacuase it has to propagate carry information

This sim­ple 3 bit adder needs some time to add 011 (3) and 001 (1) because it has to prop­a­gate carry information

So, back to the CPU inter­nals. One of the reg­is­ters has a spe­cial pur­pose: it keeps the state of the proces­sor and it’s usu­ally called STAR (STA­tus Reg­is­ter). So the ipotet­i­cal pseudo-verilog for this proces­sor would be:


reg [31:0] MBR, STAR, EAX, PC;
always @(posedge clock)
begin
case(STAR)
FETCH0: MBR <= get_memory(PC); STAR <= FETCH1; PC <= PC+4;
FETCH1: STAR <= decode_opcode(MBR);
...
INC0: EAX <= alu(OP_ADD,EAX,1) STAR <= FETCH0;
...
HLT: STAR<=HLT;
end

This would be a very sim­ple, incom­plete and totally use­less fixed instruc­tion length CPU, which can only incre­ment the accu­mu­la­tor and halt itself. Let’s explain this code a bit. MBR, STAR, EAX and PC are defined as reg­is­ters, in this case they are 32 bits wide. The STAR reg­is­ter could be larger or smaller depend­ing on the num­ber of pos­si­ble inter­nal state. The MBR is the Mas­ter Buffer Reg­is­ter and it’s used as a tem­po­rary stor­age for com­pu­ta­tions which should not be seen by the user, while EAX and PC are the accu­mu­la­tor and the pro­gram counter which we’re all used to. The \emph{always} syn­tax means that the actions has to be taken on the rais­ing edge of the clock. Reg­is­ters which are assigned in this block are usu­ally imple­mented using an array of edge trig­gered flip-flops. The <= sym­bols means non block­ing assign­ment. All those assign­ments are done in par­al­lel and they will only be effec­tive at the fol­low­ing cycle. The FETCH,INC and HLT are just sym­bolic ver­sion for some numer­i­cal con­stants, which rep­re­sent the state. The get_memory, decode_opcode and alu sym­bols are all asyn­chro­nous cir­cuits, they will take their inputs and gen­er­ate the out­put after some time, hope­fully before the next clock cycle. It must be noted that the get_memory in the real world can­not be mod­eled as an asyn­chro­nous cir­cuit, as it usu­ally takes sev­eral cycle to get some value out of memory.

You can see that in the FETCH0 case we use the PC both as an input and out­put, but this is per­fectly safe as all the com­pu­ta­tion will be based on the val­ues at the begin­ning of the cycle, and the new val­ues will be gath­ered at the end of it. How to imple­ment the case in hard­ware? Well it’s quite easy because we can exploit the for free (mod­ulo power con­sump­tion) par­al­lelism of elec­tron­ics to com­pute all the pos­si­ble out­come from each instruc­tion and then use mul­ti­plexer to get the desired one. Of course to avoid races in the inputs we also need mul­ti­plexer on the input side of com­mon cir­cuitry, such as the alu.

So... this is really the basic sim­ple way to build a com­pu­ta­tional machine. Much new inter­est­ing stuff has been added to this now, and I’ll spend some words on that in the next essays. The top­ics that I would like to cover are pipelined, super­scalar, vec­to­r­ial and mul­ti­core CPUs. Maybe I’ll also take a brief look at the new trend (or hype): het­eroge­nous CPUs, such as IBMs Cell. Com­ment and sug­ges­tion are, as always, really welcome.

, , , , ,

No Comments


Getting things Gnome! gets Remember The Milk support

rtm to gtgI’m proud to announce that the plu­gin that enables the syn­chro­niza­tion between Get­ting things gnome! and Remem­ber The Milk has landed into trunk. Hope­fully, it will be dis­trib­uted in the upcom­ing release of gtg.

The syn­chro­niza­tion cur­rently sup­port tasks title and text, tags and due dates, while respect­ing all the unsyn­ca­ble stuff (e.g.: rtm does not have the con­cept of subtask).

You can have a feel of what’s com­ing with these sim­ple commands:

sudo apt-get install bazaar
bzr branch lp:gtg
cd gtg
./gtg

A big “thank you” to the lovely gtg com­mu­nity (which has been a plea­sure to work with) is absolutely due,  espe­cially to Paulo, who designed the plu­gin system.

I’m cur­rently work­ing on other fea­tures to add to gtg, so stay tuned.

PS: Mono haters, haven’t you noticed? gtg now repli­cates some of  Tasque fea­tures. And we all know that Tasque exe­cutable ends with .exe

No Comments


West coast and binary code

g4515We’re back! Finally I’ve found some time to write, and some­thing to write about too! Me and Jacopo tem­porar­ily changed time­zone and ocean, as we are cur­rently work­ing at the Secu­rity Group of Cal­i­for­nia Uni­ver­sity, Santa Barbara.

Our cur­rent project is really inter­est­ing and could have far more appli­ca­tions than secu­rity. As it’s still in an early devel­op­ment phase, I will not yet delve in the details of our approach; let’s just say that we are work­ing on a very effi­cient way to han­dle data taint­ing and gen­eral binary code instru­men­ta­tion. As we are now get­ting to know very inti­mately the x86 machine code, we also found out that, even if a lot of opcodes are mapped in an almost sen­si­ble way, oth­ers are just scat­tered around in a seem­ingly ran­dom pat­tern. I guess that those were just added after the ini­tial design, and are prob­a­bly decoded by a WTF-is-this-instruction Unit. I can­not help think­ing ‘How sim­ple life would be if x86 were RISC’...’. Maybe we can still hope the Ita­nium stuff will come to the res­cue1, but prob­a­bly we are going to stick to this legacy from the 16-bit era for a long time. More on this next, I hope.

  1. But I, Jacopo, do not believe so. []

, , , , ,

No Comments


The quest for graphics performance: part I

lightspark-tech-demo2-revamp

Devel­op­ing and opti­miz­ing Lightspark, the mod­ern Flash player, I’m greatly expand­ing my knowl­edge and under­stand­ing of GPU inter­nals. In the last few days I’ve man­aged to find out a cou­ple of nice tricks that boosted up per­for­mance and, as nice side effects, saved CPU time and added fea­tures :-)

First of all, I’d like to intro­duce a bit of the Lightspark graph­ics architecture

The project is designed from the ground up to make use of the fea­tures offered by mod­ern graph­ics hard­ware. Namely 3D accel­er­a­tion and pro­gram­ma­ble shaders. The Flash file for­mat encodes the geome­tries to be drawn as set of edges. This rep­re­sen­ta­tion is quite dif­fer­ent from the one under­stood by GPUs. So the geome­tries are first tri­an­gu­lated (reduced to a set of tri­an­gles). This oper­a­tion is done on the CPU and is quite com­pu­ta­tion­ally inten­sive, but the results are cached, so over­all this does not hit performance.

More­over Flash offer sev­eral dif­fer­ent fill styles that should be applied on geom­e­try, for exam­ple solid color and var­i­ous kind of gra­di­ents. Lightspark han­dles all those pos­si­bil­i­ties using a sin­gle frag­ment shader, a lit­tle piece of code that is invoked on every pixel to com­pute the desired color. Of course the shader has to know about the cur­rent fill style. This infor­ma­tion along with sev­eral other para­me­ters could be passed with dif­fer­ent meth­ods. More on this on the next issue.

There is one pecu­liar thing about the shader though, let’s look at a sim­ple pseudo code:
gl_FragColor=solid_color()*selector[0]+linear_gradient()*selector[1]+circular_gradient()*selector[2]...;

Selec­tor is a binary array, the only allowed val­ues are zero or one. More­over only one value is one. This means that the cur­rent frag­ment (pixel) color is com­puted for every pos­si­ble fill style and only after­ward the cor­rect result is selected. This may look like  a waste of com­put­ing power, but it is actu­ally more effi­cient than some­thing like this:

if(selector[0])
       gl_FragColor=solid_color();
else if(selector[1])
       gl_FragColot=linear_gradient();
...

This counter intu­itive fact comes from the nature of the graph­ics hard­ware. GPUs are very dif­fer­ent from CPUs and are capa­ble of cruch­ing tons of vec­to­r­ial oper­a­tions blind­ingly fast. But they totally fall down on their knees when encoun­ter­ing branches in the code. This is actu­ally quite com­mon on deeply pipelined archi­tec­ture which misses com­plex branch pre­dic­tion cir­cuitry, not only GPUs but also num­ber crunch­ing devices and mul­ti­me­dia mon­sters such as IBM Cell. Keep this in mind when work­ing on such platforms.

, , , , , , ,

2 Comments


Lightspark second technical demo announcement

lightspark-techdemo2

I’m cur­rently fin­ish­ing some last cleanups and enhance­ments before releas­ing a sec­ond tech­ni­cal demo of the Lightspark Project. Much time is passed from the first demo, and the project is grow­ing healty. This release aims at ren­der­ing the fol­low­ing movie, selected from adobe demo. The results may not be very impres­sive. But many things are going on under the hood.

The most inter­est­ing fea­ture in this release are:

  • GLSL based ren­der­ing of fill styles (eg. gradients)
  • LLVM based Action­Script exe­cu­tion. Code is com­piled just in time
  • A few tricks are also played to decrease the stack traf­fic tipi­cal of stack machines.
  • First, although sim­ple, fram­er­ate timing
  • Frame­work to han­dle Action­Script asyn­chro­nous events. Cur­rently only the enter­Frame event works, as the input sub­sys­tem is not yet in place. But stay tuned, as I’ve some nice plan about that.

The code will be released in a cou­ple of more days, or at least I hope so :-)

, , , , ,

No Comments



SetPageWidth

Please see the following legal note (this site is hosted in Italy, the Italian version is binding):

Questo sito web utilizza Google Analytics, un servizio di analisi web fornito da Google, Inc. ("Google"). Google Analytics utilizza dei "cookies", che sono file di testo che vengono depositati sul Vostro computer per consentire al sito web di analizzare come gli utenti utilizzano il sito. Le informazioni generate dal cookie sull'utilizzo del sito web da parte Vostra (compreso il Vostro indirizzo IP) verranno trasmesse a, e depositate presso i server di Google negli Stati Uniti. Google utilizzerà queste informazioni allo scopo di tracciare e esaminare il Vostro utilizzo del sito web, compilare report sulle attività del sito web per gli operatori del sito web e fornire altri servizi relativi alle attività del sito web e all'utilizzo di Internet. Google può anche trasferire queste informazioni a terzi ove ciò sia imposto dalla legge o laddove tali terzi trattino le suddette informazioni per conto di Google. Google non assocerà il vostro indirizzo IP a nessun altro dato posseduto da Google. Potete rifiutarvi di usare i cookies selezionando l'impostazione appropriata sul vostro browser. Utilizzando il presente sito web, voi acconsentite al trattamento dei Vostri dati da parte di Google per le modalità e i fini sopraindicati