Posts Tagged Corsini

Talking about CPU architectures: The beginning

Some days ago I was talk­ing about mul­ti­core sys­tems and mod­ern CPU tech­nolo­gies with a younger engi­neer­ing stu­dent at Scuola Supe­ri­ore Sant’Anna. I was quite sur­prised by how much mis­con­cep­tions and hype sur­rounds this topic, even for peo­ple that actu­ally study com­puter engi­neer­ing. Espe­cially con­sid­er­ing that the courses of the Uni­ver­sity of Pisa are quite good when it comes to low level hard­ware stuff (com­puter engi­neer­ing was born by branch­ing elec­tronic engi­neer­ing, and the legacy is still quite evident).

So I’ll try to write some arti­cles about CPU archi­tec­tures, from the basic to advanced top­ics... well... as advanced as my cur­rent knowl­edge goes, still bet­ter than nothing.

Ok, for this essay I’ll basi­cally ref­er­ence the sim­ple proces­sor design that we study in our awe­some Logic Cir­cuits course (thanks Prof. Corsini, even if I’ve not seen a sin­gle class of it). The code pre­sented is writ­ten is some­thing hope­fully sim­i­lar to Ver­ilog, the hard­ware descrip­tion language.

This proces­sor is basi­cally a sim­ple state machine. The inter­nal state is stored in some syn­chro­nous reg­is­ter, usu­ally imple­mented using so called edge-triggered flip-flops. There is a clock source which gen­er­ates a square wave. The clock is used as an input to reg­is­ters, which are allowed to get a new value only on the rais­ing edge of the wave. The new value of the reg­is­ters is usu­ally com­puted by state­less asyn­chro­nous circuits.

PC adder

A sim­ple cir­cuit to get the address of the next instruc­tion: the new address is com­puted dur­ing the clock cycle by the asyn­chro­nous adder and assigned at the next ris­ing edge of the clock

This means that the out­put depends only on the input, and the clock sig­nal is not used. An exam­ple of this kind of cir­cuits could be the adder, or the more com­plex ALU. Those cir­cuits takes some time to com­plete their work, the adder, for exam­ple, has to prop­a­gate the carry infor­ma­tion from the least sig­nif­i­cant bit to the most sig­nif­i­cant bit. Before the com­pu­ta­tion is com­plete the out­put state under­goes sev­eral spu­ri­ous tran­si­tions (add gif) which will be ignored by the fol­low­ing reg­is­ter, as the new value will only be cap­tured on the rais­ing edge of the clock.

This allows us to under­stand two concepts:

  • The faster the clock is, the more new states can be com­puted, and so the more oper­a­tions can be done. So this is what the ven­dors are talk­ing about when they mar­ket they sev­eral GhZ CPUs.
  • Why can’t the clock be infi­nitely fast? Because to guar­an­tee that all the spu­ri­ous tran­si­tions from asyn­chro­nous cir­cuits will be dis­carted the clock cycle time has to be at least a lit­tle longer than the longest set­tle time of those circuits.

For a long time dur­ing the pre­vi­ous years it was easy to pump up the clock speeds, from the 4 Mhz of the first 8088 to the 4 Ghz of the last Pen­tiums, as the progress in CMOS techlo­gies allowed to make the cir­cuits smaller and smaller, which also means basi­cally faster and faster. But after a while this had to stop, even if tran­sis­tors are still get­ting smaller. The main prob­lem right now is that power con­sump­tion and heat pro­duc­tion go with the square of the clock fre­quency, which poses a very heavy upper limit to the rea­son­able clock speed, if you’re not will­ing to use liq­uid refrigeration.

This simple 3 bit adder needs some time to add 011 (3) and 001 (1) beacuase it has to propagate carry information

This sim­ple 3 bit adder needs some time to add 011 (3) and 001 (1) because it has to prop­a­gate carry information

So, back to the CPU inter­nals. One of the reg­is­ters has a spe­cial pur­pose: it keeps the state of the proces­sor and it’s usu­ally called STAR (STA­tus Reg­is­ter). So the ipotet­i­cal pseudo-verilog for this proces­sor would be:


reg [31:0] MBR, STAR, EAX, PC;
always @(posedge clock)
begin
case(STAR)
FETCH0: MBR <= get_memory(PC); STAR <= FETCH1; PC <= PC+4;
FETCH1: STAR <= decode_opcode(MBR);
...
INC0: EAX <= alu(OP_ADD,EAX,1) STAR <= FETCH0;
...
HLT: STAR<=HLT;
end

This would be a very sim­ple, incom­plete and totally use­less fixed instruc­tion length CPU, which can only incre­ment the accu­mu­la­tor and halt itself. Let’s explain this code a bit. MBR, STAR, EAX and PC are defined as reg­is­ters, in this case they are 32 bits wide. The STAR reg­is­ter could be larger or smaller depend­ing on the num­ber of pos­si­ble inter­nal state. The MBR is the Mas­ter Buffer Reg­is­ter and it’s used as a tem­po­rary stor­age for com­pu­ta­tions which should not be seen by the user, while EAX and PC are the accu­mu­la­tor and the pro­gram counter which we’re all used to. The \emph{always} syn­tax means that the actions has to be taken on the rais­ing edge of the clock. Reg­is­ters which are assigned in this block are usu­ally imple­mented using an array of edge trig­gered flip-flops. The <= sym­bols means non block­ing assign­ment. All those assign­ments are done in par­al­lel and they will only be effec­tive at the fol­low­ing cycle. The FETCH,INC and HLT are just sym­bolic ver­sion for some numer­i­cal con­stants, which rep­re­sent the state. The get_memory, decode_opcode and alu sym­bols are all asyn­chro­nous cir­cuits, they will take their inputs and gen­er­ate the out­put after some time, hope­fully before the next clock cycle. It must be noted that the get_memory in the real world can­not be mod­eled as an asyn­chro­nous cir­cuit, as it usu­ally takes sev­eral cycle to get some value out of memory.

You can see that in the FETCH0 case we use the PC both as an input and out­put, but this is per­fectly safe as all the com­pu­ta­tion will be based on the val­ues at the begin­ning of the cycle, and the new val­ues will be gath­ered at the end of it. How to imple­ment the case in hard­ware? Well it’s quite easy because we can exploit the for free (mod­ulo power con­sump­tion) par­al­lelism of elec­tron­ics to com­pute all the pos­si­ble out­come from each instruc­tion and then use mul­ti­plexer to get the desired one. Of course to avoid races in the inputs we also need mul­ti­plexer on the input side of com­mon cir­cuitry, such as the alu.

So... this is really the basic sim­ple way to build a com­pu­ta­tional machine. Much new inter­est­ing stuff has been added to this now, and I’ll spend some words on that in the next essays. The top­ics that I would like to cover are pipelined, super­scalar, vec­to­r­ial and mul­ti­core CPUs. Maybe I’ll also take a brief look at the new trend (or hype): het­eroge­nous CPUs, such as IBMs Cell. Com­ment and sug­ges­tion are, as always, really welcome.

, , , , ,

No Comments