Archive for October, 2009

Allocating aligned memory

Just a quick note that may be use­ful to some­one else. As you may know SSE2 intro­duced a new instruc­tion: MOVDQA (MOVe Dou­ble Quad­word Aligned). This is used to move 128 bit (16 bytes) of data from/to memory/xmm reg­is­ters. This instruc­tion only works if the data is aligned the the 16 byte bound­ary. There is also another instruc­tion for the unaligned case, but the aligned ver­sion is way faster. So let’s sum­ma­rize some tech­niques to get an aligned mem­ory area

  • For local, sta­tic and mem­ber vari­ables you can append __attribute__ (( aligned (16 ) ) to the type def­i­n­i­tion. Example:
  • struct A { int val; } __attribute__ ((aligned ( 16 ) );

  • For dynam­i­cally allo­cated mem­ory the usual mal­loc is not enough, but there is a posix_memalign which has the seman­tics that we need. It is defined as:
  • int posix_memalign(void **memptr, size_t alignment, size_t size);

So we have to pass a pointer to the pointer that will receive the newly allo­cated mem­ory, the required align­ment (which has to be a power of two) and the allo­ca­tion size. Mem­ory allo­cated this way can (at least on the glibc imple­men­ta­tion) be freed using the usual free func­tion.

, , , , , ,

No Comments

Allocating aligned memory

Just a quick note that may be use­ful to some­one else. As you may know SSE2 intro­duced a new instruc­tion: MOVDQA (MOVe Dou­ble Quad­word Aligned). This is used to move 128 bit (16 bytes) of data from/to memory/xmm reg­is­ters. This instruc­tion only works if the data is aligned the the 16 byte bound­ary. There is also another instruc­tion for the unaligned case, but the aligned ver­sion is way faster. So let’s sum­ma­rize some tech­niques to get an aligned mem­ory area

  • For local, sta­tic and mem­ber vari­ables you can append __attribute__ (( aligned (16 ) ) to the type def­i­n­i­tion. Example:
  • struct A { int val; } __attribute__ ((aligned ( 16 ) );

  • For dynam­i­cally allo­cated mem­ory the usual mal­loc is not enough, but there is a posix_memalign which has the seman­tics that we need. It is defined as:
  • int posix_memalign(void **memptr, size_t alignment, size_t size);

So we have to pass a pointer to the pointer that will receive the newly allo­cated mem­ory, the required align­ment (which has to be a power of two) and the allo­ca­tion size. Mem­ory allo­cated this way can (at least on the glibc imple­men­ta­tion) be freed using the usual free func­tion.

, , , , , ,

No Comments

Talking about CPU architectures: Then it came assembly chain

As we’ve seen in the pre­vi­ous essay in a sim­ple proces­sor design each instruc­tion needs sev­eral cycles to be com­pleted. Each instruc­tion passes sev­eral steps: first it’s loaded from mem­ory (FETCH0 state), then it’s decoded (FETCH1 state) and the appro­pri­ate code path is selected to com­pute the oper­a­tion results. In the my exam­ple I’ve not included other two essen­tial steps: operands fetch­ing and results stor­ing on mem­ory. This sequen­tial approach has a big draw­back, as for most of the time the CPU cir­cuits (such as the mem­ory inter­face or the ALU) are not doing use­ful work. If on aver­age each instruc­tion takes five steps to com­plete, then the proces­sor is only exe­cut­ing one instruc­tion every five cycles. To solve this prob­lem we can intro­duce the pipeline.

The pipeline con­cept is bor­rowed from the indus­trial assem­bly line. The basic idea is to decou­ple the sequen­tial steps of each instruc­tion by insert­ing buffer reg­is­ters in between. Each stage of the pipeline is iso­lated, and on each clock cycle it can take the input from the reg­is­ter upstream and out­put the results into the reg­is­ter down­stream. Of course not every instruc­tion needs pro­cess­ing by all stages (for exam­ple it may or may not access mem­ory) and so each stage has to have some kind of No Oper­a­tion mode to pass trough the unmod­i­fied input reg­is­ter to the output.

A sim­ple pseudo-verilog descrip­tion of a pipelined proces­sor fol­lows. We are assum­ing that the instruc­tion and data mem­ory are phys­i­cally sep­a­rate: an Havard like archi­tec­ture. This is not true in cur­rent main­stream archi­tec­tures, as both instruc­tion and data are stored in RAM, but may be con­sid­ered true to a large extend if there is a fast instruc­tion cache. More­over it has to be noted that although the access_memory mod­ule is ref­er­enced both in the input fetch­ing and out­put stor­ing stages, it can only cor­rectly serve one request at a time, and so some kind of arbi­tra­tion has to be used. We can arbi­trar­ily decide that reads are pri­or­i­tized over write, but in case of a col­li­sion the oper­a­tion will be for sure not cor­rect, though safe for the elec­tron­ics. The register_file mod­ule is just a mul­ti­plexer for the reg­is­ters, it will out­put the value of the reg­is­ters for the pro­vided indexes. The mod­ule it’s used both in the operands fetch­ing and write­back stages, but it’s safe because it allows two reg­is­ter reads and one write and the same time (three way access).

This toy proces­sor is inspired from MIPS archi­tec­ture and as you can see it uses fixed length instruc­tions. But I’ve also tried to design some­thing more sim­i­lar to an x86 proces­sor. The instruc­tion set (which I’m too lazy to define exten­sively) could allow both register-register and register-memory oper­a­tion, and out­put to either mem­ory or reg­is­ters. No indi­rect mem­ory access is allowed, sorry. As the instruc­tions are four byte long prob­a­bly only a 16 or 24 bit flat address space could be accessed.


always @(posedge clock)
begin
//
//Instructrion fetch stage
FETCHED_INSTRUCTION <= get_instruction_memory(PC); PC <= PC+4;
//
//Instruction decoding stage
{OPCODE_0, OPERAND1_TYPE, REG1_INDEX, REG2_INDEX, MEM_ADDRESS_0, WRITE_TO_MEMORY_0, DEST_REG_0} <= decode_instruction(FETCHED_INSTRUCTION);
//
//Operands fecthing stage, some register will be forwarded to the next stages
OPERAND1 <= (OPERAND1_TYPE==REG) ?
read_register_file(REG1_INDEX, REG2_INDEX)[31:0] :
access_memory(MEM_ADDRESS_0,READ,UNSPECIFIED);
OPERAND2 <= register_file(REG1_INDEX,REG2_INDEX)[63:32];
MEM_ADDRESS_1 <= MEM_ADDRESS_0;
WRITE_TO_MEMORY_1 <= WRITE_TO_MEMORY_0;
DEST_REG_1 <= DEST_REG_0;
OPCODE_1 <= OPCODE_0;
//
//Execution stage
RESULT <= alu(OPCODE_1,OPERAND1,OPERAND2);
MEM_ADDRESS_2 <= MEM_ADDRESS_1;
WRITE_TO_MEMORY_2 <= WRITE_TO_MEMORY_1;
DEST_REG_2 <= DEST_REG_1;
//
//Writeback stage
(WRITE_TO_MEMORY_2) ? access_memory(MEM_ADDRESS_2,WRITE,RESULT) : write_register_file(DEST_REG_2, RESULT);
end

Sev­eral con­sid­er­a­tions can be made on this design. First of all extreme care is needed from the com­piler (or the user, if it’s crazy enough to write assem­bly man­u­ally), because there are sev­eral haz­ards, not only as said pre­vi­ously on con­cur­rent read/write mem­ory access, but also on reg­is­ters. As the result of an instruc­tion is writ­ten back to the reg­is­ter file only at the end of the pipeline it’s not legal to access a result before five more instruc­tion has been exe­cuted. To guar­an­tee mem­ory access valid­ity it’s enough to avoid issu­ing a mem­ory write two instruc­tion after a mem­ory load. If no use­ful instruc­tion can be issued dur­ing this forced wait­ing, a NOP should be inserted.

Real world archi­tec­ture employs inter­lock­ing tech­niques which takes care of stop­ping (stalling) the pipeline when an haz­ard could occur. I’ll describe stalling in more detail in another essay. Any­way if a pipeline stage, such us the ALU for com­pli­cated com­pu­ta­tion (for exam­ple inte­ger divi­sion), needs more than a cycle to com­plete it will stall the provi­ous stages of the pipeline, and insert so called bub­bles into the later stage. Bub­bles are basi­cally safe No Oper­a­tion input states, the NOP opcode in any archi­tec­ture is just an explicit pipeline bub­ble. Even in mod­ern inter­locked archi­tec­ture cor­rect instruc­tion sched­ul­ing is extremely use­ful to gain per­for­mance as pipeline bub­bles actu­ally waste exe­cu­tion time.

That’s all for now, in the pre­vi­ous post I for­got to men­tion the ever fail­ing, but very nice Very Long Instruc­tion Word (VLIW) approach. I’m of course going to talk about that to sometime.

, , , , , , , , ,

No Comments

Talking about CPU architectures: Then it came assembly chain

As we’ve seen in the pre­vi­ous essay in a sim­ple proces­sor design each instruc­tion needs sev­eral cycles to be com­pleted. Each instruc­tion passes sev­eral steps: first it’s loaded from mem­ory (FETCH0 state), then it’s decoded (FETCH1 state) and the appro­pri­ate code path is selected to com­pute the oper­a­tion results. In the my exam­ple I’ve not included other two essen­tial steps: operands fetch­ing and results stor­ing on mem­ory. This sequen­tial approach has a big draw­back, as for most of the time the CPU cir­cuits (such as the mem­ory inter­face or the ALU) are not doing use­ful work. If on aver­age each instruc­tion takes five steps to com­plete, then the proces­sor is only exe­cut­ing one instruc­tion every five cycles. To solve this prob­lem we can intro­duce the pipeline.

The pipeline con­cept is bor­rowed from the indus­trial assem­bly line. The basic idea is to decou­ple the sequen­tial steps of each instruc­tion by insert­ing buffer reg­is­ters in between. Each stage of the pipeline is iso­lated, and on each clock cycle it can take the input from the reg­is­ter upstream and out­put the results into the reg­is­ter down­stream. Of course not every instruc­tion needs pro­cess­ing by all stages (for exam­ple it may or may not access mem­ory) and so each stage has to have some kind of No Oper­a­tion mode to pass trough the unmod­i­fied input reg­is­ter to the output.

A sim­ple pseudo-verilog descrip­tion of a pipelined proces­sor fol­lows. We are assum­ing that the instruc­tion and data mem­ory are phys­i­cally sep­a­rate: an Havard like archi­tec­ture. This is not true in cur­rent main­stream archi­tec­tures, as both instruc­tion and data are stored in RAM, but may be con­sid­ered true to a large extend if there is a fast instruc­tion cache. More­over it has to be noted that although the access_memory mod­ule is ref­er­enced both in the input fetch­ing and out­put stor­ing stages, it can only cor­rectly serve one request at a time, and so some kind of arbi­tra­tion has to be used. We can arbi­trar­ily decide that reads are pri­or­i­tized over write, but in case of a col­li­sion the oper­a­tion will be for sure not cor­rect, though safe for the elec­tron­ics. The register_file mod­ule is just a mul­ti­plexer for the reg­is­ters, it will out­put the value of the reg­is­ters for the pro­vided indexes. The mod­ule it’s used both in the operands fetch­ing and write­back stages, but it’s safe because it allows two reg­is­ter reads and one write and the same time (three way access).

This toy proces­sor is inspired from MIPS archi­tec­ture and as you can see it uses fixed length instruc­tions. But I’ve also tried to design some­thing more sim­i­lar to an x86 proces­sor. The instruc­tion set (which I’m too lazy to define exten­sively) could allow both register-register and register-memory oper­a­tion, and out­put to either mem­ory or reg­is­ters. No indi­rect mem­ory access is allowed, sorry. As the instruc­tions are four byte long prob­a­bly only a 16 or 24 bit flat address space could be accessed.


always @(posedge clock)
begin
//
//Instructrion fetch stage
FETCHED_INSTRUCTION <= get_instruction_memory(PC); PC <= PC+4;
//
//Instruction decoding stage
{OPCODE_0, OPERAND1_TYPE, REG1_INDEX, REG2_INDEX, MEM_ADDRESS_0, WRITE_TO_MEMORY_0, DEST_REG_0} <= decode_instruction(FETCHED_INSTRUCTION);
//
//Operands fecthing stage, some register will be forwarded to the next stages
OPERAND1 <= (OPERAND1_TYPE==REG) ?
read_register_file(REG1_INDEX, REG2_INDEX)[31:0] :
access_memory(MEM_ADDRESS_0,READ,UNSPECIFIED);
OPERAND2 <= register_file(REG1_INDEX,REG2_INDEX)[63:32];
MEM_ADDRESS_1 <= MEM_ADDRESS_0;
WRITE_TO_MEMORY_1 <= WRITE_TO_MEMORY_0;
DEST_REG_1 <= DEST_REG_0;
OPCODE_1 <= OPCODE_0;
//
//Execution stage
RESULT <= alu(OPCODE_1,OPERAND1,OPERAND2);
MEM_ADDRESS_2 <= MEM_ADDRESS_1;
WRITE_TO_MEMORY_2 <= WRITE_TO_MEMORY_1;
DEST_REG_2 <= DEST_REG_1;
//
//Writeback stage
(WRITE_TO_MEMORY_2) ? access_memory(MEM_ADDRESS_2,WRITE,RESULT) : write_register_file(DEST_REG_2, RESULT);
end

Sev­eral con­sid­er­a­tions can be made on this design. First of all extreme care is needed from the com­piler (or the user, if it’s crazy enough to write assem­bly man­u­ally), because there are sev­eral haz­ards, not only as said pre­vi­ously on con­cur­rent read/write mem­ory access, but also on reg­is­ters. As the result of an instruc­tion is writ­ten back to the reg­is­ter file only at the end of the pipeline it’s not legal to access a result before five more instruc­tion has been exe­cuted. To guar­an­tee mem­ory access valid­ity it’s enough to avoid issu­ing a mem­ory write two instruc­tion after a mem­ory load. If no use­ful instruc­tion can be issued dur­ing this forced wait­ing, a NOP should be inserted.

Real world archi­tec­ture employs inter­lock­ing tech­niques which takes care of stop­ping (stalling) the pipeline when an haz­ard could occur. I’ll describe stalling in more detail in another essay. Any­way if a pipeline stage, such us the ALU for com­pli­cated com­pu­ta­tion (for exam­ple inte­ger divi­sion), needs more than a cycle to com­plete it will stall the provi­ous stages of the pipeline, and insert so called bub­bles into the later stage. Bub­bles are basi­cally safe No Oper­a­tion input states, the NOP opcode in any archi­tec­ture is just an explicit pipeline bub­ble. Even in mod­ern inter­locked archi­tec­ture cor­rect instruc­tion sched­ul­ing is extremely use­ful to gain per­for­mance as pipeline bub­bles actu­ally waste exe­cu­tion time.

That’s all for now, in the pre­vi­ous post I for­got to men­tion the ever fail­ing, but very nice Very Long Instruc­tion Word (VLIW) approach. I’m of course going to talk about that to sometime.

, , , , , , , , ,

No Comments

Talking about CPU architectures: The beginning

Some days ago I was talk­ing about mul­ti­core sys­tems and mod­ern CPU tech­nolo­gies with a younger engi­neer­ing stu­dent at Scuola Supe­ri­ore Sant’Anna. I was quite sur­prised by how much mis­con­cep­tions and hype sur­rounds this topic, even for peo­ple that actu­ally study com­puter engi­neer­ing. Espe­cially con­sid­er­ing that the courses of the Uni­ver­sity of Pisa are quite good when it comes to low level hard­ware stuff (com­puter engi­neer­ing was born by branch­ing elec­tronic engi­neer­ing, and the legacy is still quite evident).

So I’ll try to write some arti­cles about CPU archi­tec­tures, from the basic to advanced top­ics... well... as advanced as my cur­rent knowl­edge goes, still bet­ter than nothing.

Ok, for this essay I’ll basi­cally ref­er­ence the sim­ple proces­sor design that we study in our awe­some Logic Cir­cuits course (thanks Prof. Corsini, even if I’ve not seen a sin­gle class of it). The code pre­sented is writ­ten is some­thing hope­fully sim­i­lar to Ver­ilog, the hard­ware descrip­tion language.

This proces­sor is basi­cally a sim­ple state machine. The inter­nal state is stored in some syn­chro­nous reg­is­ter, usu­ally imple­mented using so called edge-triggered flip-flops. There is a clock source which gen­er­ates a square wave. The clock is used as an input to reg­is­ters, which are allowed to get a new value only on the rais­ing edge of the wave. The new value of the reg­is­ters is usu­ally com­puted by state­less asyn­chro­nous circuits.

PC adder

A sim­ple cir­cuit to get the address of the next instruc­tion: the new address is com­puted dur­ing the clock cycle by the asyn­chro­nous adder and assigned at the next ris­ing edge of the clock

This means that the out­put depends only on the input, and the clock sig­nal is not used. An exam­ple of this kind of cir­cuits could be the adder, or the more com­plex ALU. Those cir­cuits takes some time to com­plete their work, the adder, for exam­ple, has to prop­a­gate the carry infor­ma­tion from the least sig­nif­i­cant bit to the most sig­nif­i­cant bit. Before the com­pu­ta­tion is com­plete the out­put state under­goes sev­eral spu­ri­ous tran­si­tions (add gif) which will be ignored by the fol­low­ing reg­is­ter, as the new value will only be cap­tured on the rais­ing edge of the clock.

This allows us to under­stand two concepts:

  • The faster the clock is, the more new states can be com­puted, and so the more oper­a­tions can be done. So this is what the ven­dors are talk­ing about when they mar­ket they sev­eral GhZ CPUs.
  • Why can’t the clock be infi­nitely fast? Because to guar­an­tee that all the spu­ri­ous tran­si­tions from asyn­chro­nous cir­cuits will be dis­carted the clock cycle time has to be at least a lit­tle longer than the longest set­tle time of those circuits.

For a long time dur­ing the pre­vi­ous years it was easy to pump up the clock speeds, from the 4 Mhz of the first 8088 to the 4 Ghz of the last Pen­tiums, as the progress in CMOS techlo­gies allowed to make the cir­cuits smaller and smaller, which also means basi­cally faster and faster. But after a while this had to stop, even if tran­sis­tors are still get­ting smaller. The main prob­lem right now is that power con­sump­tion and heat pro­duc­tion go with the square of the clock fre­quency, which poses a very heavy upper limit to the rea­son­able clock speed, if you’re not will­ing to use liq­uid refrigeration.

This simple 3 bit adder needs some time to add 011 (3) and 001 (1) beacuase it has to propagate carry information

This sim­ple 3 bit adder needs some time to add 011 (3) and 001 (1) because it has to prop­a­gate carry information

So, back to the CPU inter­nals. One of the reg­is­ters has a spe­cial pur­pose: it keeps the state of the proces­sor and it’s usu­ally called STAR (STA­tus Reg­is­ter). So the ipotet­i­cal pseudo-verilog for this proces­sor would be:


reg [31:0] MBR, STAR, EAX, PC;
always @(posedge clock)
begin
case(STAR)
FETCH0: MBR <= get_memory(PC); STAR <= FETCH1; PC <= PC+4;
FETCH1: STAR <= decode_opcode(MBR);
...
INC0: EAX <= alu(OP_ADD,EAX,1) STAR <= FETCH0;
...
HLT: STAR<=HLT;
end

This would be a very sim­ple, incom­plete and totally use­less fixed instruc­tion length CPU, which can only incre­ment the accu­mu­la­tor and halt itself. Let’s explain this code a bit. MBR, STAR, EAX and PC are defined as reg­is­ters, in this case they are 32 bits wide. The STAR reg­is­ter could be larger or smaller depend­ing on the num­ber of pos­si­ble inter­nal state. The MBR is the Mas­ter Buffer Reg­is­ter and it’s used as a tem­po­rary stor­age for com­pu­ta­tions which should not be seen by the user, while EAX and PC are the accu­mu­la­tor and the pro­gram counter which we’re all used to. The \emph{always} syn­tax means that the actions has to be taken on the rais­ing edge of the clock. Reg­is­ters which are assigned in this block are usu­ally imple­mented using an array of edge trig­gered flip-flops. The <= sym­bols means non block­ing assign­ment. All those assign­ments are done in par­al­lel and they will only be effec­tive at the fol­low­ing cycle. The FETCH,INC and HLT are just sym­bolic ver­sion for some numer­i­cal con­stants, which rep­re­sent the state. The get_memory, decode_opcode and alu sym­bols are all asyn­chro­nous cir­cuits, they will take their inputs and gen­er­ate the out­put after some time, hope­fully before the next clock cycle. It must be noted that the get_memory in the real world can­not be mod­eled as an asyn­chro­nous cir­cuit, as it usu­ally takes sev­eral cycle to get some value out of memory.

You can see that in the FETCH0 case we use the PC both as an input and out­put, but this is per­fectly safe as all the com­pu­ta­tion will be based on the val­ues at the begin­ning of the cycle, and the new val­ues will be gath­ered at the end of it. How to imple­ment the case in hard­ware? Well it’s quite easy because we can exploit the for free (mod­ulo power con­sump­tion) par­al­lelism of elec­tron­ics to com­pute all the pos­si­ble out­come from each instruc­tion and then use mul­ti­plexer to get the desired one. Of course to avoid races in the inputs we also need mul­ti­plexer on the input side of com­mon cir­cuitry, such as the alu.

So... this is really the basic sim­ple way to build a com­pu­ta­tional machine. Much new inter­est­ing stuff has been added to this now, and I’ll spend some words on that in the next essays. The top­ics that I would like to cover are pipelined, super­scalar, vec­to­r­ial and mul­ti­core CPUs. Maybe I’ll also take a brief look at the new trend (or hype): het­eroge­nous CPUs, such as IBMs Cell. Com­ment and sug­ges­tion are, as always, really welcome.

, , , , ,

No Comments

Talking about CPU architectures: The beginning

Some days ago I was talk­ing about mul­ti­core sys­tems and mod­ern CPU tech­nolo­gies with a younger engi­neer­ing stu­dent at Scuola Supe­ri­ore Sant’Anna. I was quite sur­prised by how much mis­con­cep­tions and hype sur­rounds this topic, even for peo­ple that actu­ally study com­puter engi­neer­ing. Espe­cially con­sid­er­ing that the courses of the Uni­ver­sity of Pisa are quite good when it comes to low level hard­ware stuff (com­puter engi­neer­ing was born by branch­ing elec­tronic engi­neer­ing, and the legacy is still quite evident).

So I’ll try to write some arti­cles about CPU archi­tec­tures, from the basic to advanced top­ics... well... as advanced as my cur­rent knowl­edge goes, still bet­ter than nothing.

Ok, for this essay I’ll basi­cally ref­er­ence the sim­ple proces­sor design that we study in our awe­some Logic Cir­cuits course (thanks Prof. Corsini, even if I’ve not seen a sin­gle class of it). The code pre­sented is writ­ten is some­thing hope­fully sim­i­lar to Ver­ilog, the hard­ware descrip­tion language.

This proces­sor is basi­cally a sim­ple state machine. The inter­nal state is stored in some syn­chro­nous reg­is­ter, usu­ally imple­mented using so called edge-triggered flip-flops. There is a clock source which gen­er­ates a square wave. The clock is used as an input to reg­is­ters, which are allowed to get a new value only on the rais­ing edge of the wave. The new value of the reg­is­ters is usu­ally com­puted by state­less asyn­chro­nous circuits.

PC adder

A sim­ple cir­cuit to get the address of the next instruc­tion: the new address is com­puted dur­ing the clock cycle by the asyn­chro­nous adder and assigned at the next ris­ing edge of the clock

This means that the out­put depends only on the input, and the clock sig­nal is not used. An exam­ple of this kind of cir­cuits could be the adder, or the more com­plex ALU. Those cir­cuits takes some time to com­plete their work, the adder, for exam­ple, has to prop­a­gate the carry infor­ma­tion from the least sig­nif­i­cant bit to the most sig­nif­i­cant bit. Before the com­pu­ta­tion is com­plete the out­put state under­goes sev­eral spu­ri­ous tran­si­tions (add gif) which will be ignored by the fol­low­ing reg­is­ter, as the new value will only be cap­tured on the rais­ing edge of the clock.

This allows us to under­stand two concepts:

  • The faster the clock is, the more new states can be com­puted, and so the more oper­a­tions can be done. So this is what the ven­dors are talk­ing about when they mar­ket they sev­eral GhZ CPUs.
  • Why can’t the clock be infi­nitely fast? Because to guar­an­tee that all the spu­ri­ous tran­si­tions from asyn­chro­nous cir­cuits will be dis­carted the clock cycle time has to be at least a lit­tle longer than the longest set­tle time of those circuits.

For a long time dur­ing the pre­vi­ous years it was easy to pump up the clock speeds, from the 4 Mhz of the first 8088 to the 4 Ghz of the last Pen­tiums, as the progress in CMOS techlo­gies allowed to make the cir­cuits smaller and smaller, which also means basi­cally faster and faster. But after a while this had to stop, even if tran­sis­tors are still get­ting smaller. The main prob­lem right now is that power con­sump­tion and heat pro­duc­tion go with the square of the clock fre­quency, which poses a very heavy upper limit to the rea­son­able clock speed, if you’re not will­ing to use liq­uid refrigeration.

This simple 3 bit adder needs some time to add 011 (3) and 001 (1) beacuase it has to propagate carry information

This sim­ple 3 bit adder needs some time to add 011 (3) and 001 (1) because it has to prop­a­gate carry information

So, back to the CPU inter­nals. One of the reg­is­ters has a spe­cial pur­pose: it keeps the state of the proces­sor and it’s usu­ally called STAR (STA­tus Reg­is­ter). So the ipotet­i­cal pseudo-verilog for this proces­sor would be:


reg [31:0] MBR, STAR, EAX, PC;
always @(posedge clock)
begin
case(STAR)
FETCH0: MBR <= get_memory(PC); STAR <= FETCH1; PC <= PC+4;
FETCH1: STAR <= decode_opcode(MBR);
...
INC0: EAX <= alu(OP_ADD,EAX,1) STAR <= FETCH0;
...
HLT: STAR<=HLT;
end

This would be a very sim­ple, incom­plete and totally use­less fixed instruc­tion length CPU, which can only incre­ment the accu­mu­la­tor and halt itself. Let’s explain this code a bit. MBR, STAR, EAX and PC are defined as reg­is­ters, in this case they are 32 bits wide. The STAR reg­is­ter could be larger or smaller depend­ing on the num­ber of pos­si­ble inter­nal state. The MBR is the Mas­ter Buffer Reg­is­ter and it’s used as a tem­po­rary stor­age for com­pu­ta­tions which should not be seen by the user, while EAX and PC are the accu­mu­la­tor and the pro­gram counter which we’re all used to. The \emph{always} syn­tax means that the actions has to be taken on the rais­ing edge of the clock. Reg­is­ters which are assigned in this block are usu­ally imple­mented using an array of edge trig­gered flip-flops. The <= sym­bols means non block­ing assign­ment. All those assign­ments are done in par­al­lel and they will only be effec­tive at the fol­low­ing cycle. The FETCH,INC and HLT are just sym­bolic ver­sion for some numer­i­cal con­stants, which rep­re­sent the state. The get_memory, decode_opcode and alu sym­bols are all asyn­chro­nous cir­cuits, they will take their inputs and gen­er­ate the out­put after some time, hope­fully before the next clock cycle. It must be noted that the get_memory in the real world can­not be mod­eled as an asyn­chro­nous cir­cuit, as it usu­ally takes sev­eral cycle to get some value out of memory.

You can see that in the FETCH0 case we use the PC both as an input and out­put, but this is per­fectly safe as all the com­pu­ta­tion will be based on the val­ues at the begin­ning of the cycle, and the new val­ues will be gath­ered at the end of it. How to imple­ment the case in hard­ware? Well it’s quite easy because we can exploit the for free (mod­ulo power con­sump­tion) par­al­lelism of elec­tron­ics to com­pute all the pos­si­ble out­come from each instruc­tion and then use mul­ti­plexer to get the desired one. Of course to avoid races in the inputs we also need mul­ti­plexer on the input side of com­mon cir­cuitry, such as the alu.

So... this is really the basic sim­ple way to build a com­pu­ta­tional machine. Much new inter­est­ing stuff has been added to this now, and I’ll spend some words on that in the next essays. The top­ics that I would like to cover are pipelined, super­scalar, vec­to­r­ial and mul­ti­core CPUs. Maybe I’ll also take a brief look at the new trend (or hype): het­eroge­nous CPUs, such as IBMs Cell. Com­ment and sug­ges­tion are, as always, really welcome.

, , , , ,

No Comments