# Archive for category Bar discussions

### A C++ feature I would love to have: custom type modifiers

In C++ the const type mod­i­fiers when applied to meth­ods has the nice prop­erty of mak­ing it impos­si­ble by default (e.g. with­out a const_cast) to call non-const meth­ods. A sim­i­lar fea­ture is pro­vided by C++0x key­word con­s­t­expr.

It would be very nice if this prop­erty could be applied to cus­tom classes of func­tions. The idea would be to tag method def­i­n­i­tions with a cus­tom key­word, or a type. Tagged meth­ods would then only be able to call method tagged with the same type.

A nice usage I could find for this would be to define a non-blocking class of meth­ods. Non block­ing meth­ods are such if they never blocks for a poten­tially unbounded amount of time wait­ing for an event. Using the tags the com­piler will be able to detect if a method sup­posed to be non-blocking is try­ing to call some poten­tially block­ing code and this would be very use­ful to detect pos­si­ble dead­locks at com­pile time. Another usage would be to tag all the meth­ods that are sup­posed to be used by a spe­cific thread with the same tag. With such setup meth­ods that access shared data would be untagged and using them would require a cast or some spe­cial syn­tax, hope­fully help­ing the pro­gram­mer remem­ber that some syn­chro­niza­tion is needed.

A pos­si­ble approach to (ab)use the volatile key­word to detect race con­di­tions was described here in 2001. But maybe this approach could be more gen­eral. Any com­ment?

As a depar­ture from the usual tech­ni­cal arti­cles and release announce­ments I’d like to talk a bit about my uni­ver­sity: the Sant’Anna School of Advanced Stud­ies (Scuola Supe­ri­ore Sant’Anna in Ital­ian). This place is quite unique in Italy, the only sim­i­lar insti­tu­tion being the Scuola Nor­male Supe­ri­ore. I’ve never heard about sim­i­lar insti­tu­tions in other coun­tries. I’d love to hear if any­one knows about some­thing like this some­where else.

At a first glance the Sant’Anna of Advanced Stud­ies could be descibed as a very small, state owned (pub­lic) uni­ver­sity. By “very small” I mean that the grand total of the stu­dents attend­ing bach­e­lors and mas­ter classes is no more than 250. Each year around 45 stu­dents are selected in a very strict nation­wide selec­tion from around  1000 appli­cants in the fields of Engi­neer­ing, Eco­nom­ics, Med­i­cine, Agri­cul­ture, Polit­i­cal Sci­ence and Law. Selected appli­cants have access to the school ser­vices com­pletely free of charge, includ­ing lodg­ing and canteen.

The high level of selec­tiv­ity (the peak is 1 admit­ted each 20 appli­cants) makes it pos­si­ble to gather some of the most tal­ented peo­ple grad­u­ated in the nation’s high schools. As the school pro­vides all needed infra­struc­ture to the stu­dents peo­ple are able to leave their home and come to Pisa inde­pen­dently of the eco­nomic and social sta­tus of their fam­i­lies. There is no way to access the School other than the offi­cial selec­tion, not even pay­ing a fee.

As stu­dents we live in col­le­giate struc­ture.  The rooms are assigned fol­low­ing our own inter­nal reg­u­la­tion, this is an exam­ple of the large amount of man­age­ment that is del­e­gated to stu­dents. Sev­eral com­mis­sions are elected each year to man­age a vari­ety of things, from net­work infra­struc­ture to pub­lic rela­tions. Liv­ing in com­mon struc­tures and shar­ing the respon­si­bil­ity of their main­te­nance gen­er­ate a very pecu­liar and often long term bond­ing between stu­dents. Dur­ing the five years that most peo­ple spend in the school a net of rela­tions forms that con­nects peo­ple that are very dif­fer­ent for inter­ests  and expe­ri­ences. Liv­ing in a inter­dis­ci­pli­nary envi­ron­ment is an extra­or­di­nary expe­ri­ence. It’s very com­mon to see peo­ple dis­cussing about any­thing from phi­los­o­phy and pol­i­tics to trap­pist beers with the same com­mit­ment, most often while eat­ing together at the canteen.

From the aca­d­e­mic point of view the School is also pretty unique. Although it’s a full state rec­og­nized uni­ver­sity it does not offers full courses. We are at the same time Pupils of the Sant’Anna School and stu­dents of the Uni­ver­sity of Pisa. We attends reg­u­lar classes at the Uni­ver­sity of Pisa and classes on more advanced mod­ern research top­ics. Any­way most peo­ple feel that the most impor­tant part of what we learn comes from liv­ing with older stu­dents and research expe­ri­ences abroad. I’ve myself worked in Cal­i­for­nia and Switzer­land dur­ing the last cou­ple of summers.

Feel free to com­ment and share your opin­ions about such par­tic­u­lar kind of university.

### Talking about CPU architectures: Then it came assembly chain

As we’ve seen in the pre­vi­ous essay in a sim­ple proces­sor design each instruc­tion needs sev­eral cycles to be com­pleted. Each instruc­tion passes sev­eral steps: first it’s loaded from mem­ory (FETCH0 state), then it’s decoded (FETCH1 state) and the appro­pri­ate code path is selected to com­pute the oper­a­tion results. In the my exam­ple I’ve not included other two essen­tial steps: operands fetch­ing and results stor­ing on mem­ory. This sequen­tial approach has a big draw­back, as for most of the time the CPU cir­cuits (such as the mem­ory inter­face or the ALU) are not doing use­ful work. If on aver­age each instruc­tion takes five steps to com­plete, then the proces­sor is only exe­cut­ing one instruc­tion every five cycles. To solve this prob­lem we can intro­duce the pipeline.

The pipeline con­cept is bor­rowed from the indus­trial assem­bly line. The basic idea is to decou­ple the sequen­tial steps of each instruc­tion by insert­ing buffer reg­is­ters in between. Each stage of the pipeline is iso­lated, and on each clock cycle it can take the input from the reg­is­ter upstream and out­put the results into the reg­is­ter down­stream. Of course not every instruc­tion needs pro­cess­ing by all stages (for exam­ple it may or may not access mem­ory) and so each stage has to have some kind of No Oper­a­tion mode to pass trough the unmod­i­fied input reg­is­ter to the output.

A sim­ple pseudo-verilog descrip­tion of a pipelined proces­sor fol­lows. We are assum­ing that the instruc­tion and data mem­ory are phys­i­cally sep­a­rate: an Havard like archi­tec­ture. This is not true in cur­rent main­stream archi­tec­tures, as both instruc­tion and data are stored in RAM, but may be con­sid­ered true to a large extend if there is a fast instruc­tion cache. More­over it has to be noted that although the access_memory mod­ule is ref­er­enced both in the input fetch­ing and out­put stor­ing stages, it can only cor­rectly serve one request at a time, and so some kind of arbi­tra­tion has to be used. We can arbi­trar­ily decide that reads are pri­or­i­tized over write, but in case of a col­li­sion the oper­a­tion will be for sure not cor­rect, though safe for the elec­tron­ics. The register_file mod­ule is just a mul­ti­plexer for the reg­is­ters, it will out­put the value of the reg­is­ters for the pro­vided indexes. The mod­ule it’s used both in the operands fetch­ing and write­back stages, but it’s safe because it allows two reg­is­ter reads and one write and the same time (three way access).

This toy proces­sor is inspired from MIPS archi­tec­ture and as you can see it uses fixed length instruc­tions. But I’ve also tried to design some­thing more sim­i­lar to an x86 proces­sor. The instruc­tion set (which I’m too lazy to define exten­sively) could allow both register-register and register-memory oper­a­tion, and out­put to either mem­ory or reg­is­ters. No indi­rect mem­ory access is allowed, sorry. As the instruc­tions are four byte long prob­a­bly only a 16 or 24 bit flat address space could be accessed.

 always @(posedge clock) begin // //Instructrion fetch stage FETCHED_INSTRUCTION <= get_instruction_memory(PC); PC <= PC+4; // //Instruction decoding stage {OPCODE_0, OPERAND1_TYPE, REG1_INDEX, REG2_INDEX, MEM_ADDRESS_0, WRITE_TO_MEMORY_0, DEST_REG_0} <= decode_instruction(FETCHED_INSTRUCTION); // //Operands fecthing stage, some register will be forwarded to the next stages OPERAND1 <= (OPERAND1_TYPE==REG) ? read_register_file(REG1_INDEX, REG2_INDEX)[31:0] : access_memory(MEM_ADDRESS_0,READ,UNSPECIFIED); OPERAND2 <= register_file(REG1_INDEX,REG2_INDEX)[63:32]; MEM_ADDRESS_1 <= MEM_ADDRESS_0; WRITE_TO_MEMORY_1 <= WRITE_TO_MEMORY_0; DEST_REG_1 <= DEST_REG_0; OPCODE_1 <= OPCODE_0; // //Execution stage RESULT <= alu(OPCODE_1,OPERAND1,OPERAND2); MEM_ADDRESS_2 <= MEM_ADDRESS_1; WRITE_TO_MEMORY_2 <= WRITE_TO_MEMORY_1; DEST_REG_2 <= DEST_REG_1; // //Writeback stage (WRITE_TO_MEMORY_2) ? access_memory(MEM_ADDRESS_2,WRITE,RESULT) : write_register_file(DEST_REG_2, RESULT); end 

Sev­eral con­sid­er­a­tions can be made on this design. First of all extreme care is needed from the com­piler (or the user, if it’s crazy enough to write assem­bly man­u­ally), because there are sev­eral haz­ards, not only as said pre­vi­ously on con­cur­rent read/write mem­ory access, but also on reg­is­ters. As the result of an instruc­tion is writ­ten back to the reg­is­ter file only at the end of the pipeline it’s not legal to access a result before five more instruc­tion has been exe­cuted. To guar­an­tee mem­ory access valid­ity it’s enough to avoid issu­ing a mem­ory write two instruc­tion after a mem­ory load. If no use­ful instruc­tion can be issued dur­ing this forced wait­ing, a NOP should be inserted.

Real world archi­tec­ture employs inter­lock­ing tech­niques which takes care of stop­ping (stalling) the pipeline when an haz­ard could occur. I’ll describe stalling in more detail in another essay. Any­way if a pipeline stage, such us the ALU for com­pli­cated com­pu­ta­tion (for exam­ple inte­ger divi­sion), needs more than a cycle to com­plete it will stall the provi­ous stages of the pipeline, and insert so called bub­bles into the later stage. Bub­bles are basi­cally safe No Oper­a­tion input states, the NOP opcode in any archi­tec­ture is just an explicit pipeline bub­ble. Even in mod­ern inter­locked archi­tec­ture cor­rect instruc­tion sched­ul­ing is extremely use­ful to gain per­for­mance as pipeline bub­bles actu­ally waste exe­cu­tion time.

That’s all for now, in the pre­vi­ous post I for­got to men­tion the ever fail­ing, but very nice Very Long Instruc­tion Word (VLIW) approach. I’m of course going to talk about that to sometime.

### Talking about CPU architectures: The beginning

Some days ago I was talk­ing about mul­ti­core sys­tems and mod­ern CPU tech­nolo­gies with a younger engi­neer­ing stu­dent at Scuola Supe­ri­ore Sant’Anna. I was quite sur­prised by how much mis­con­cep­tions and hype sur­rounds this topic, even for peo­ple that actu­ally study com­puter engi­neer­ing. Espe­cially con­sid­er­ing that the courses of the Uni­ver­sity of Pisa are quite good when it comes to low level hard­ware stuff (com­puter engi­neer­ing was born by branch­ing elec­tronic engi­neer­ing, and the legacy is still quite evident).

So I’ll try to write some arti­cles about CPU archi­tec­tures, from the basic to advanced top­ics... well... as advanced as my cur­rent knowl­edge goes, still bet­ter than nothing.

Ok, for this essay I’ll basi­cally ref­er­ence the sim­ple proces­sor design that we study in our awe­some Logic Cir­cuits course (thanks Prof. Corsini, even if I’ve not seen a sin­gle class of it). The code pre­sented is writ­ten is some­thing hope­fully sim­i­lar to Ver­ilog, the hard­ware descrip­tion language.

This proces­sor is basi­cally a sim­ple state machine. The inter­nal state is stored in some syn­chro­nous reg­is­ter, usu­ally imple­mented using so called edge-triggered flip-flops. There is a clock source which gen­er­ates a square wave. The clock is used as an input to reg­is­ters, which are allowed to get a new value only on the rais­ing edge of the wave. The new value of the reg­is­ters is usu­ally com­puted by state­less asyn­chro­nous circuits.

A sim­ple cir­cuit to get the address of the next instruc­tion: the new address is com­puted dur­ing the clock cycle by the asyn­chro­nous adder and assigned at the next ris­ing edge of the clock

This means that the out­put depends only on the input, and the clock sig­nal is not used. An exam­ple of this kind of cir­cuits could be the adder, or the more com­plex ALU. Those cir­cuits takes some time to com­plete their work, the adder, for exam­ple, has to prop­a­gate the carry infor­ma­tion from the least sig­nif­i­cant bit to the most sig­nif­i­cant bit. Before the com­pu­ta­tion is com­plete the out­put state under­goes sev­eral spu­ri­ous tran­si­tions (add gif) which will be ignored by the fol­low­ing reg­is­ter, as the new value will only be cap­tured on the rais­ing edge of the clock.

This allows us to under­stand two concepts:

• The faster the clock is, the more new states can be com­puted, and so the more oper­a­tions can be done. So this is what the ven­dors are talk­ing about when they mar­ket they sev­eral GhZ CPUs.
• Why can’t the clock be infi­nitely fast? Because to guar­an­tee that all the spu­ri­ous tran­si­tions from asyn­chro­nous cir­cuits will be dis­carted the clock cycle time has to be at least a lit­tle longer than the longest set­tle time of those circuits.

For a long time dur­ing the pre­vi­ous years it was easy to pump up the clock speeds, from the 4 Mhz of the first 8088 to the 4 Ghz of the last Pen­tiums, as the progress in CMOS techlo­gies allowed to make the cir­cuits smaller and smaller, which also means basi­cally faster and faster. But after a while this had to stop, even if tran­sis­tors are still get­ting smaller. The main prob­lem right now is that power con­sump­tion and heat pro­duc­tion go with the square of the clock fre­quency, which poses a very heavy upper limit to the rea­son­able clock speed, if you’re not will­ing to use liq­uid refrigeration.

This sim­ple 3 bit adder needs some time to add 011 (3) and 001 (1) because it has to prop­a­gate carry information

So, back to the CPU inter­nals. One of the reg­is­ters has a spe­cial pur­pose: it keeps the state of the proces­sor and it’s usu­ally called STAR (STA­tus Reg­is­ter). So the ipotet­i­cal pseudo-verilog for this proces­sor would be:

 reg [31:0] MBR, STAR, EAX, PC; always @(posedge clock) begin case(STAR) FETCH0: MBR <= get_memory(PC); STAR <= FETCH1; PC <= PC+4; FETCH1: STAR <= decode_opcode(MBR); ... INC0: EAX <= alu(OP_ADD,EAX,1) STAR <= FETCH0; ... HLT: STAR<=HLT; end 

This would be a very sim­ple, incom­plete and totally use­less fixed instruc­tion length CPU, which can only incre­ment the accu­mu­la­tor and halt itself. Let’s explain this code a bit. MBR, STAR, EAX and PC are defined as reg­is­ters, in this case they are 32 bits wide. The STAR reg­is­ter could be larger or smaller depend­ing on the num­ber of pos­si­ble inter­nal state. The MBR is the Mas­ter Buffer Reg­is­ter and it’s used as a tem­po­rary stor­age for com­pu­ta­tions which should not be seen by the user, while EAX and PC are the accu­mu­la­tor and the pro­gram counter which we’re all used to. The \emph{always} syn­tax means that the actions has to be taken on the rais­ing edge of the clock. Reg­is­ters which are assigned in this block are usu­ally imple­mented using an array of edge trig­gered flip-flops. The <= sym­bols means non block­ing assign­ment. All those assign­ments are done in par­al­lel and they will only be effec­tive at the fol­low­ing cycle. The FETCH,INC and HLT are just sym­bolic ver­sion for some numer­i­cal con­stants, which rep­re­sent the state. The get_memory, decode_opcode and alu sym­bols are all asyn­chro­nous cir­cuits, they will take their inputs and gen­er­ate the out­put after some time, hope­fully before the next clock cycle. It must be noted that the get_memory in the real world can­not be mod­eled as an asyn­chro­nous cir­cuit, as it usu­ally takes sev­eral cycle to get some value out of memory.

You can see that in the FETCH0 case we use the PC both as an input and out­put, but this is per­fectly safe as all the com­pu­ta­tion will be based on the val­ues at the begin­ning of the cycle, and the new val­ues will be gath­ered at the end of it. How to imple­ment the case in hard­ware? Well it’s quite easy because we can exploit the for free (mod­ulo power con­sump­tion) par­al­lelism of elec­tron­ics to com­pute all the pos­si­ble out­come from each instruc­tion and then use mul­ti­plexer to get the desired one. Of course to avoid races in the inputs we also need mul­ti­plexer on the input side of com­mon cir­cuitry, such as the alu.

So... this is really the basic sim­ple way to build a com­pu­ta­tional machine. Much new inter­est­ing stuff has been added to this now, and I’ll spend some words on that in the next essays. The top­ics that I would like to cover are pipelined, super­scalar, vec­to­r­ial and mul­ti­core CPUs. Maybe I’ll also take a brief look at the new trend (or hype): het­eroge­nous CPUs, such as IBMs Cell. Com­ment and sug­ges­tion are, as always, really welcome.

### West coast and binary code

We’re back! Finally I’ve found some time to write, and some­thing to write about too! Me and Jacopo tem­porar­ily changed time­zone and ocean, as we are cur­rently work­ing at the Secu­rity Group of Cal­i­for­nia Uni­ver­sity, Santa Barbara.

Our cur­rent project is really inter­est­ing and could have far more appli­ca­tions than secu­rity. As it’s still in an early devel­op­ment phase, I will not yet delve in the details of our approach; let’s just say that we are work­ing on a very effi­cient way to han­dle data taint­ing and gen­eral binary code instru­men­ta­tion. As we are now get­ting to know very inti­mately the x86 machine code, we also found out that, even if a lot of opcodes are mapped in an almost sen­si­ble way, oth­ers are just scat­tered around in a seem­ingly ran­dom pat­tern. I guess that those were just added after the ini­tial design, and are prob­a­bly decoded by a WTF-is-this-instruction Unit. I can­not help think­ing ‘How sim­ple life would be if x86 were RISC’...’. Maybe we can still hope the Ita­nium stuff will come to the res­cue1, but prob­a­bly we are going to stick to this legacy from the 16-bit era for a long time. More on this next, I hope.

1. But I, Jacopo, do not believe so. []

### Google Chrome bug #9007, should we care?

Much dis­cus­sion spawned from Google Chrome bug #9007. The prob­lem is actu­allty quite sim­ple: Chrome depends on SSE2 instruc­tions and so, when run on proces­sors which do not sup­port such exten­sion, will crash cry­ing ‘Ille­gal instruc­tion’ . This arcane look­ing mes­sage sim­ply means: “Come on my friend, go buy a new computer.”

To under­stand my opin­ion, let me talk a bit about the SSE fam­ily of extensions.

MMX/SSE intro­duced in main­stream com­put­ing the SIMD com­pu­ta­tional model. SIMD means ‘Sin­gle Instruc­tion Mul­ti­ple Data’, so for each instruc­tion the same com­pu­ta­tion is exe­cuted over sev­eral indipen­dent data. This kind of instruc­tions are extremely use­ful in  math­e­mat­ics and mul­ti­me­dia appli­ca­tions. I don’t think SSE can be right­fully called an exten­sion. It’s actu­ally a nec­es­sary fea­ture which was miss­ing in the early Intel desings. SSE2 was intro­duced back in 2001. And it’s cur­rently sup­ported on every mod­ern proces­sor, from Atoms to bleed­ing edge quad core Xeons. Keep in mind that by using SIMD instruc­tions it is pos­si­ble to speedup the code by a fac­tor of 4 or 8. And this could make the dif­fer­ence between a real­time and an offline application.

I really think there is no rea­son for Google devel­op­ers to waste time and resource to sup­port obso­les­cent machines. Good code should be eff­i­cent in terms of proces­sor time, power con­sump­tion and mem­ory allo­ca­tion. To obtain such goals it is often nec­es­sary to exploit new features.

This con­sid­er­a­tion is also the foun­da­tion of the lightspark project design. I’m mak­ing use of every fea­ture a mod­ern plat­form offers, such as heavy mul­ti­thread­ing sup­port, mul­ti­le­dia exten­sions and pro­gram­ma­ble graphic cards. All of this to obtain a soft­ware which is fast and lean on resources, even on lim­ited plat­form such as Mobile Inter­net Deviced and sub-notebooks.