As we’ve seen in the previous essay in a simple processor design each instruction needs several cycles to be completed. Each instruction passes several steps: first it’s loaded from memory (FETCH0 state), then it’s decoded (FETCH1 state) and the appropriate code path is selected to compute the operation results. In the my example I’ve not included other two essential steps: operands fetching and results storing on memory. This sequential approach has a big drawback, as for most of the time the CPU circuits (such as the memory interface or the ALU) are not doing useful work. If on average each instruction takes five steps to complete, then the processor is only executing one instruction every five cycles. To solve this problem we can introduce the pipeline.
The pipeline concept is borrowed from the industrial assembly line. The basic idea is to decouple the sequential steps of each instruction by inserting buffer registers in between. Each stage of the pipeline is isolated, and on each clock cycle it can take the input from the register upstream and output the results into the register downstream. Of course not every instruction needs processing by all stages (for example it may or may not access memory) and so each stage has to have some kind of No Operation mode to pass trough the unmodified input register to the output.
A simple pseudo-verilog description of a pipelined processor follows. We are assuming that the instruction and data memory are physically separate: an Havard like architecture. This is not true in current mainstream architectures, as both instruction and data are stored in RAM, but may be considered true to a large extend if there is a fast instruction cache. Moreover it has to be noted that although the access_memory module is referenced both in the input fetching and output storing stages, it can only correctly serve one request at a time, and so some kind of arbitration has to be used. We can arbitrarily decide that reads are prioritized over write, but in case of a collision the operation will be for sure not correct, though safe for the electronics. The register_file module is just a multiplexer for the registers, it will output the value of the registers for the provided indexes. The module it’s used both in the operands fetching and writeback stages, but it’s safe because it allows two register reads and one write and the same time (three way access).
This toy processor is inspired from MIPS architecture and as you can see it uses fixed length instructions. But I’ve also tried to design something more similar to an x86 processor. The instruction set (which I’m too lazy to define extensively) could allow both register-register and register-memory operation, and output to either memory or registers. No indirect memory access is allowed, sorry. As the instructions are four byte long probably only a 16 or 24 bit flat address space could be accessed.
always @(posedge clock)
begin
//
//Instructrion fetch stage
FETCHED_INSTRUCTION <= get_instruction_memory(PC); PC <= PC+4;
//
//Instruction decoding stage
{OPCODE_0, OPERAND1_TYPE, REG1_INDEX, REG2_INDEX, MEM_ADDRESS_0, WRITE_TO_MEMORY_0, DEST_REG_0} <= decode_instruction(FETCHED_INSTRUCTION);
//
//Operands fecthing stage, some register will be forwarded to the next stages
OPERAND1 <= (OPERAND1_TYPE==REG) ?
read_register_file(REG1_INDEX, REG2_INDEX)[31:0] :
access_memory(MEM_ADDRESS_0,READ,UNSPECIFIED);
OPERAND2 <= register_file(REG1_INDEX,REG2_INDEX)[63:32];
MEM_ADDRESS_1 <= MEM_ADDRESS_0;
WRITE_TO_MEMORY_1 <= WRITE_TO_MEMORY_0;
DEST_REG_1 <= DEST_REG_0;
OPCODE_1 <= OPCODE_0;
//
//Execution stage
RESULT <= alu(OPCODE_1,OPERAND1,OPERAND2);
MEM_ADDRESS_2 <= MEM_ADDRESS_1;
WRITE_TO_MEMORY_2 <= WRITE_TO_MEMORY_1;
DEST_REG_2 <= DEST_REG_1;
//
//Writeback stage
(WRITE_TO_MEMORY_2) ? access_memory(MEM_ADDRESS_2,WRITE,RESULT) : write_register_file(DEST_REG_2, RESULT);
end
Several considerations can be made on this design. First of all extreme care is needed from the compiler (or the user, if it’s crazy enough to write assembly manually), because there are several hazards, not only as said previously on concurrent read/write memory access, but also on registers. As the result of an instruction is written back to the register file only at the end of the pipeline it’s not legal to access a result before five more instruction has been executed. To guarantee memory access validity it’s enough to avoid issuing a memory write two instruction after a memory load. If no useful instruction can be issued during this forced waiting, a NOP should be inserted.
Real world architecture employs interlocking techniques which takes care of stopping (stalling) the pipeline when an hazard could occur. I’ll describe stalling in more detail in another essay. Anyway if a pipeline stage, such us the ALU for complicated computation (for example integer division), needs more than a cycle to complete it will stall the provious stages of the pipeline, and insert so called bubbles into the later stage. Bubbles are basically safe No Operation input states, the NOP opcode in any architecture is just an explicit pipeline bubble. Even in modern interlocked architecture correct instruction scheduling is extremely useful to gain performance as pipeline bubbles actually waste execution time.
That’s all for now, in the previous post I forgot to mention the ever failing, but very nice Very Long Instruction Word (VLIW) approach. I’m of course going to talk about that to sometime.