Talking about CPU architectures: Then it came assembly chain


As we’ve seen in the pre­vi­ous essay in a sim­ple proces­sor design each instruc­tion needs sev­eral cycles to be com­pleted. Each instruc­tion passes sev­eral steps: first it’s loaded from mem­ory (FETCH0 state), then it’s decoded (FETCH1 state) and the appro­pri­ate code path is selected to com­pute the oper­a­tion results. In the my exam­ple I’ve not included other two essen­tial steps: operands fetch­ing and results stor­ing on mem­ory. This sequen­tial approach has a big draw­back, as for most of the time the CPU cir­cuits (such as the mem­ory inter­face or the ALU) are not doing use­ful work. If on aver­age each instruc­tion takes five steps to com­plete, then the proces­sor is only exe­cut­ing one instruc­tion every five cycles. To solve this prob­lem we can intro­duce the pipeline.

The pipeline con­cept is bor­rowed from the indus­trial assem­bly line. The basic idea is to decou­ple the sequen­tial steps of each instruc­tion by insert­ing buffer reg­is­ters in between. Each stage of the pipeline is iso­lated, and on each clock cycle it can take the input from the reg­is­ter upstream and out­put the results into the reg­is­ter down­stream. Of course not every instruc­tion needs pro­cess­ing by all stages (for exam­ple it may or may not access mem­ory) and so each stage has to have some kind of No Oper­a­tion mode to pass trough the unmod­i­fied input reg­is­ter to the output.

A sim­ple pseudo-verilog descrip­tion of a pipelined proces­sor fol­lows. We are assum­ing that the instruc­tion and data mem­ory are phys­i­cally sep­a­rate: an Havard like archi­tec­ture. This is not true in cur­rent main­stream archi­tec­tures, as both instruc­tion and data are stored in RAM, but may be con­sid­ered true to a large extend if there is a fast instruc­tion cache. More­over it has to be noted that although the access_memory mod­ule is ref­er­enced both in the input fetch­ing and out­put stor­ing stages, it can only cor­rectly serve one request at a time, and so some kind of arbi­tra­tion has to be used. We can arbi­trar­ily decide that reads are pri­or­i­tized over write, but in case of a col­li­sion the oper­a­tion will be for sure not cor­rect, though safe for the elec­tron­ics. The register_file mod­ule is just a mul­ti­plexer for the reg­is­ters, it will out­put the value of the reg­is­ters for the pro­vided indexes. The mod­ule it’s used both in the operands fetch­ing and write­back stages, but it’s safe because it allows two reg­is­ter reads and one write and the same time (three way access).

This toy proces­sor is inspired from MIPS archi­tec­ture and as you can see it uses fixed length instruc­tions. But I’ve also tried to design some­thing more sim­i­lar to an x86 proces­sor. The instruc­tion set (which I’m too lazy to define exten­sively) could allow both register-register and register-memory oper­a­tion, and out­put to either mem­ory or reg­is­ters. No indi­rect mem­ory access is allowed, sorry. As the instruc­tions are four byte long prob­a­bly only a 16 or 24 bit flat address space could be accessed.


always @(posedge clock)
begin
//
//Instructrion fetch stage
FETCHED_INSTRUCTION <= get_instruction_memory(PC); PC <= PC+4;
//
//Instruction decoding stage
{OPCODE_0, OPERAND1_TYPE, REG1_INDEX, REG2_INDEX, MEM_ADDRESS_0, WRITE_TO_MEMORY_0, DEST_REG_0} <= decode_instruction(FETCHED_INSTRUCTION);
//
//Operands fecthing stage, some register will be forwarded to the next stages
OPERAND1 <= (OPERAND1_TYPE==REG) ?
read_register_file(REG1_INDEX, REG2_INDEX)[31:0] :
access_memory(MEM_ADDRESS_0,READ,UNSPECIFIED);
OPERAND2 <= register_file(REG1_INDEX,REG2_INDEX)[63:32];
MEM_ADDRESS_1 <= MEM_ADDRESS_0;
WRITE_TO_MEMORY_1 <= WRITE_TO_MEMORY_0;
DEST_REG_1 <= DEST_REG_0;
OPCODE_1 <= OPCODE_0;
//
//Execution stage
RESULT <= alu(OPCODE_1,OPERAND1,OPERAND2);
MEM_ADDRESS_2 <= MEM_ADDRESS_1;
WRITE_TO_MEMORY_2 <= WRITE_TO_MEMORY_1;
DEST_REG_2 <= DEST_REG_1;
//
//Writeback stage
(WRITE_TO_MEMORY_2) ? access_memory(MEM_ADDRESS_2,WRITE,RESULT) : write_register_file(DEST_REG_2, RESULT);
end

Sev­eral con­sid­er­a­tions can be made on this design. First of all extreme care is needed from the com­piler (or the user, if it’s crazy enough to write assem­bly man­u­ally), because there are sev­eral haz­ards, not only as said pre­vi­ously on con­cur­rent read/write mem­ory access, but also on reg­is­ters. As the result of an instruc­tion is writ­ten back to the reg­is­ter file only at the end of the pipeline it’s not legal to access a result before five more instruc­tion has been exe­cuted. To guar­an­tee mem­ory access valid­ity it’s enough to avoid issu­ing a mem­ory write two instruc­tion after a mem­ory load. If no use­ful instruc­tion can be issued dur­ing this forced wait­ing, a NOP should be inserted.

Real world archi­tec­ture employs inter­lock­ing tech­niques which takes care of stop­ping (stalling) the pipeline when an haz­ard could occur. I’ll describe stalling in more detail in another essay. Any­way if a pipeline stage, such us the ALU for com­pli­cated com­pu­ta­tion (for exam­ple inte­ger divi­sion), needs more than a cycle to com­plete it will stall the provi­ous stages of the pipeline, and insert so called bub­bles into the later stage. Bub­bles are basi­cally safe No Oper­a­tion input states, the NOP opcode in any archi­tec­ture is just an explicit pipeline bub­ble. Even in mod­ern inter­locked archi­tec­ture cor­rect instruc­tion sched­ul­ing is extremely use­ful to gain per­for­mance as pipeline bub­bles actu­ally waste exe­cu­tion time.

That’s all for now, in the pre­vi­ous post I for­got to men­tion the ever fail­ing, but very nice Very Long Instruc­tion Word (VLIW) approach. I’m of course going to talk about that to sometime.

, , , , , , , , ,

Talking about CPU architectures: Then it came assembly chain


As we’ve seen in the pre­vi­ous essay in a sim­ple proces­sor design each instruc­tion needs sev­eral cycles to be com­pleted. Each instruc­tion passes sev­eral steps: first it’s loaded from mem­ory (FETCH0 state), then it’s decoded (FETCH1 state) and the appro­pri­ate code path is selected to com­pute the oper­a­tion results. In the my exam­ple I’ve not included other two essen­tial steps: operands fetch­ing and results stor­ing on mem­ory. This sequen­tial approach has a big draw­back, as for most of the time the CPU cir­cuits (such as the mem­ory inter­face or the ALU) are not doing use­ful work. If on aver­age each instruc­tion takes five steps to com­plete, then the proces­sor is only exe­cut­ing one instruc­tion every five cycles. To solve this prob­lem we can intro­duce the pipeline.

The pipeline con­cept is bor­rowed from the indus­trial assem­bly line. The basic idea is to decou­ple the sequen­tial steps of each instruc­tion by insert­ing buffer reg­is­ters in between. Each stage of the pipeline is iso­lated, and on each clock cycle it can take the input from the reg­is­ter upstream and out­put the results into the reg­is­ter down­stream. Of course not every instruc­tion needs pro­cess­ing by all stages (for exam­ple it may or may not access mem­ory) and so each stage has to have some kind of No Oper­a­tion mode to pass trough the unmod­i­fied input reg­is­ter to the output.

A sim­ple pseudo-verilog descrip­tion of a pipelined proces­sor fol­lows. We are assum­ing that the instruc­tion and data mem­ory are phys­i­cally sep­a­rate: an Havard like archi­tec­ture. This is not true in cur­rent main­stream archi­tec­tures, as both instruc­tion and data are stored in RAM, but may be con­sid­ered true to a large extend if there is a fast instruc­tion cache. More­over it has to be noted that although the access_memory mod­ule is ref­er­enced both in the input fetch­ing and out­put stor­ing stages, it can only cor­rectly serve one request at a time, and so some kind of arbi­tra­tion has to be used. We can arbi­trar­ily decide that reads are pri­or­i­tized over write, but in case of a col­li­sion the oper­a­tion will be for sure not cor­rect, though safe for the elec­tron­ics. The register_file mod­ule is just a mul­ti­plexer for the reg­is­ters, it will out­put the value of the reg­is­ters for the pro­vided indexes. The mod­ule it’s used both in the operands fetch­ing and write­back stages, but it’s safe because it allows two reg­is­ter reads and one write and the same time (three way access).

This toy proces­sor is inspired from MIPS archi­tec­ture and as you can see it uses fixed length instruc­tions. But I’ve also tried to design some­thing more sim­i­lar to an x86 proces­sor. The instruc­tion set (which I’m too lazy to define exten­sively) could allow both register-register and register-memory oper­a­tion, and out­put to either mem­ory or reg­is­ters. No indi­rect mem­ory access is allowed, sorry. As the instruc­tions are four byte long prob­a­bly only a 16 or 24 bit flat address space could be accessed.


always @(posedge clock)
begin
//
//Instructrion fetch stage
FETCHED_INSTRUCTION <= get_instruction_memory(PC); PC <= PC+4;
//
//Instruction decoding stage
{OPCODE_0, OPERAND1_TYPE, REG1_INDEX, REG2_INDEX, MEM_ADDRESS_0, WRITE_TO_MEMORY_0, DEST_REG_0} <= decode_instruction(FETCHED_INSTRUCTION);
//
//Operands fecthing stage, some register will be forwarded to the next stages
OPERAND1 <= (OPERAND1_TYPE==REG) ?
read_register_file(REG1_INDEX, REG2_INDEX)[31:0] :
access_memory(MEM_ADDRESS_0,READ,UNSPECIFIED);
OPERAND2 <= register_file(REG1_INDEX,REG2_INDEX)[63:32];
MEM_ADDRESS_1 <= MEM_ADDRESS_0;
WRITE_TO_MEMORY_1 <= WRITE_TO_MEMORY_0;
DEST_REG_1 <= DEST_REG_0;
OPCODE_1 <= OPCODE_0;
//
//Execution stage
RESULT <= alu(OPCODE_1,OPERAND1,OPERAND2);
MEM_ADDRESS_2 <= MEM_ADDRESS_1;
WRITE_TO_MEMORY_2 <= WRITE_TO_MEMORY_1;
DEST_REG_2 <= DEST_REG_1;
//
//Writeback stage
(WRITE_TO_MEMORY_2) ? access_memory(MEM_ADDRESS_2,WRITE,RESULT) : write_register_file(DEST_REG_2, RESULT);
end

Sev­eral con­sid­er­a­tions can be made on this design. First of all extreme care is needed from the com­piler (or the user, if it’s crazy enough to write assem­bly man­u­ally), because there are sev­eral haz­ards, not only as said pre­vi­ously on con­cur­rent read/write mem­ory access, but also on reg­is­ters. As the result of an instruc­tion is writ­ten back to the reg­is­ter file only at the end of the pipeline it’s not legal to access a result before five more instruc­tion has been exe­cuted. To guar­an­tee mem­ory access valid­ity it’s enough to avoid issu­ing a mem­ory write two instruc­tion after a mem­ory load. If no use­ful instruc­tion can be issued dur­ing this forced wait­ing, a NOP should be inserted.

Real world archi­tec­ture employs inter­lock­ing tech­niques which takes care of stop­ping (stalling) the pipeline when an haz­ard could occur. I’ll describe stalling in more detail in another essay. Any­way if a pipeline stage, such us the ALU for com­pli­cated com­pu­ta­tion (for exam­ple inte­ger divi­sion), needs more than a cycle to com­plete it will stall the provi­ous stages of the pipeline, and insert so called bub­bles into the later stage. Bub­bles are basi­cally safe No Oper­a­tion input states, the NOP opcode in any archi­tec­ture is just an explicit pipeline bub­ble. Even in mod­ern inter­locked archi­tec­ture cor­rect instruc­tion sched­ul­ing is extremely use­ful to gain per­for­mance as pipeline bub­bles actu­ally waste exe­cu­tion time.

That’s all for now, in the pre­vi­ous post I for­got to men­tion the ever fail­ing, but very nice Very Long Instruc­tion Word (VLIW) approach. I’m of course going to talk about that to sometime.

, , , , , , , , ,