1-2 Instruction Set and Execution Time Model

To permit a rough comparison of algorithms, we imagine them being coded for a machine with an instruction set similar to that of today's general purpose RISC computers, such as the Compaq Alpha, the SGI MIPS, and the IBM RS/6000. The machine is three-address and has a fairly large number of general purpose registers梩hat is, 16 or more. Unless otherwise specified, the registers are 32 bits long. General register 0 contains a permanent 0, and the others can be used uniformly for any purpose.

In the interest of simplicity there are no "special purpose" registers, such as a condition register or a register to hold status bits, such as "overflow." No floating-point operations are described, because that is beyond the scope of this book.

We recognize two varieties of RISC: a "basic RISC," having the instructions shown in Table 1-2, and a "full RISC," having all the instructions of the basic RISC plus those shown in Table 1-3.

Table 1-2. Basic RISC Instruction Set

Opcode Mnemonic

Operands

Description

add, sub, mul, div, divu, rem, remu

RT,RA,RB

RT RA op RB, where op is add, subtract, multiply, divide signed, divide unsigned, remainder signed, or remainder unsigned.

addi, muli

RT,RA,I

RT RA op I, where op is add or multiply, and I is a 16-bit signed immediate value.

addis

RT,RA,I

RT RA + (I || 0x0000).

and, or, xor

RT,RA,RB

RT RA op RB, where op is bitwise and, or, or exclusive or.

andi, ori, xori

RT,RA,Iu

As above, except the last operand is a 16-bit unsigned immediate value.

beq, bne, blt, ble, bgt, bge

RT,target

Branch to target if RT = 0, or if RT 0, or if RT < 0, or if RT 0, or if RT > 0, or if RT 0 (signed integer interpretation of RT).

bt, bf

RT,target

Branch true/false; same as bne/beq resp.

cmpeq, cmpne, cmplt, cmple, cmpgt, cmpge, cmpltu, cmpleu, cmpgtu, cmpgeu

RT,RA,RB

RT gets the result of comparing RA with RB; 0 if false and 1 if true. Mnemonics denote compare for equality, inequality, less than, and so on, as for the branch instructions, and in addition, the suffix "u" denotes an unsigned comparison.

cmpieq, cmpine, cmpilt, cmpile, cmpigt, cmpige

RT,RA,I

Like cmpeq, and so on, except the second comparand is a 16-bit signed immediate value.

cmpiequ, cmpineu, cmpiltu, cmpileu, cmpigtu, cmpigeu

RT,RA,Iu

Like cmpltu, and so on, except the second comparand is a 16-bit unsigned immediate value.

ldbu, ldh, ldhu, ldw

RT,d(RA)

Load an unsigned byte, signed halfword, unsigned halfword, or word into RT from memory at location RA + d, where d is a 16-bit signed immediate value.

mulhs, mulhu

RT,RA,RB

RT gets the high-order 32 bits of the product of RA and RB; signed and unsigned.

not

RT,RA

RT bitwise one's-complement of RA.

shl, shr, shrs

RT,RA,RB

RT RA shifted left or right by the amount given in the rightmost six bits of RB; 0-fill except for shrs, which is sign-fill. (The shift amount is treated modulo 64.)

shli, shri, shrsi

RT,RA,Iu

RT RA shifted left or right by the amount given in the 5-bit immediate field.

stb, sth, stw

RS,d(RA)

Store a byte, halfword, or word, from RS into memory at location RA + d, where d is a 16-bit signed immediate value.

In these brief instruction descriptions, RA and RB appearing as source operands really means the contents of those registers.

A real machine would have branch and link (for subroutine calls), branch to the address contained in a register (for subroutine returns and "switches"), and possibly some instructions for dealing with special purpose registers. It would, of course, have a number of privileged instructions and instructions for calling on supervisor services. It might also have floating-point instructions.

Some other computational instructions that a RISC computer might have are identified in Table 1-3. These are discussed in later chapters.

Table 1-3. Additional Instructions for the "Full RISC"

Opcode Mnemonic

Operands

Description

abs, nabs

RT,RA

RT gets the absolute value, or the negative of the absolute value, of RA.

andc, eqv, nand, nor, orc

RT,RA,RB

Bitwise and with complement (of RB), equivalence, negative and, negative or, and or with complement.

extr

RT,RA,I,L

Extract bits I through I+L-1 of RA, and place them right-adjusted in RT, with 0-fill.

extrs

RT,RA,I,L

Like extr, but sign-fill.

ins

RT,RA,I,L

Insert bits 0 through L-1 of RA into bits I through I+L-1 of RT.

nlz

RT,RA

RT gets the number of leading 0's in RA (0 to 32).

pop

RT,RA

RT gets the number of 1-bits in RA (0 to 32).

ldb

RT,d(RA)

Load a signed byte into RT from memory at location RA + d, where d is a 16-bit signed immediate value.

moveq, movne, movlt, movle, movgt, movge

RT,RA,RB

RT RB if RA = 0, or if RA 0, and so on, else RT is unchanged.

shlr, shrr

RT,RA,RB

RT RA rotate-shifted left or right by the amount given in the rightmost five bits of RB.

shlri, shrri

RT,RA,Iu

RT RA rotate-shifted left or right by the amount given in the 5-bit immediate field.

trpeq, trpne, trplt, trple, trpgt, trpge, trpltu, trpleu, trpgtu, trpgeu

RA,RB

Trap (interrupt) if RA = RB, or RA RB, and so on.

trpieq, trpine, trpilt, trpile, trpigt, trpige

RA,I

Like trpeq, and so on, except the second comparand is a 16-bit signed immediate value.

trpigtu, trpigeu trpiequ, trpineu, trpiltu, trpileu,

RA,Iu

Like trpltu, and so on, except the second comparand is a 16-bit unsigned immediate value.

It is convenient to provide the machine's assembler with a few "extended mnemonics." These are like macros whose expansion is usually a single instruction. Some possibilities are shown in Table 1-4.

Table 1-4. Extended Mnemonics

Extended Mnemonic

Expansion

Description

b target

beq R0,target

Unconditional branch.

li RT,I

See text

Load immediate, -231 I < 232.

mov RT,RA

ori RT,RA,0

Move register RA to RT.

neg RT,RA

sub RT,R0,RA

Negate (two's-complement).

subi RT,RA,I

addi RT,RA,-I

Subtract immediate (I -215).

The load immediate instruction expands into one or two instructions, as required by the immediate value I. For example, if 0 I < 216, an or immediate (ori) from R0 can be used. If -215 I < 0, an add immediate (addi) from R0 can be used. If the rightmost 16 bits of I are 0, add immediate shifted (addis) can be used. Otherwise, two instructions are required, such as addis followed by ori. (Alternatively, in the last case a load from memory could be used, but for execution time and space estimates we assume that two elementary arithmetic instructions are used.)

Of course, which instructions belong in the basic RISC, and which belong in the full RISC is very much a matter of judgment. Quite possibly, divide unsigned and the remainder instructions should be moved to the full RISC category. Shift right signed is another suspicious instruction, given its low frequency of use in the SPEC benchmarks. The trouble is, in C it is easy to accidentally use these instructions, by doing a division with unsigned operands when they could just as well be signed, and by doing a shift right with a signed quantity (int) that could just as well be unsigned. Incidentally, shift right signed (or shift right arithmetic, as it is often called) does not do a division of a signed integer by a power of 2; you need to add 1 to the result if the dividend is negative and any nonzero bits are shifted out.

The distinction between basic and full RISC involves many other such questionable judgments, but we won't dwell on them.

The instructions are limited to two source registers and one target, which simplifies the computer (e.g., the register file requires no more than two read ports and one write port). It also simplifies an optimizing compiler, because the compiler does not need to deal with instructions that have multiple targets. The price paid for this is that a program that wants both the quotient and remainder of two numbers (not uncommon) must execute two instructions (divide and remainder). The usual machine division algorithm produces the remainder as a by-product, so many machines make them both available as a result of one execution of divide. Similar remarks apply to obtaining the doubleword product of two words.

The conditional move instructions (e.g., moveq) ostensibly have only two source operands, but in a sense they have three. Because the result of the instruction depends on the values in RT, RA, and RB, a machine that executes instructions out of order must treat RT in these instructions as both a use and a set. That is, an instruction that sets RT, followed by a conditional move that sets RT, must be executed in that order, and the result of the first instruction cannot be discarded. Thus, the designer of such a machine may elect to omit the conditional move instructions to avoid having to consider an instruction with (logically) three source operands. On the other hand, the conditional move instructions do save branches.

Instruction formats are not relevant to the purposes of this book, but the full RISC instruction set described above, with floating point and a few supervisory instructions added, can be implemented with 32-bit instructions on a machine with 32 general purpose registers (5-bit register fields). By reducing the immediate fields of compare, load, store, and trap instructions to 14 bits, the same holds for a machine with 64 general purpose registers (6-bit register fields).

Execution Time

We assume that all instructions execute in one cycle, except for the multiply, divide, and remainder instructions, for which we do not assume any particular execution time. Branches take one cycle whether they branch or fall through.

The load immediate instruction is counted as one or two cycles, depending on whether one or two elementary arithmetic instructions are required to generate the constant in a register.

Although load and store instructions are not often used in this book, we assume they take one cycle and ignore any load delay (time lapse between when a load instruction completes in the arithmetic unit, and when the requested data is available for a subsequent instruction).

However, knowing the number of cycles used by all the arithmetic and logical instructions is often insufficient for estimating the execution time of a program. Execution can be slowed substantially by load delays and by delays in fetching instructions. These delays, although very important and increasing in importance, are not discussed in this book. Another factor, one which improves execution time, is what is called "instruction-level parallelism," which is found in many contemporary RISC chips, particularly those for "high-end" machines.

These machines have multiple execution units and sufficient instruction-dispatching capability to execute instructions in parallel when they are independent (that is, when neither uses a result of the other, and they don't both set the same register or status bit). Because this capability is now quite common, the presence of independent operations is often pointed out in this book. Thus, we might say that such and such a formula can be coded in such a way that it requires eight instructions and executes in five cycles on a machine with unlimited instruction-level parallelism. This means that if the instructions are arranged in the proper order ("scheduled"), a machine with a sufficient number of adders, shifters, logical units, and registers can in principle execute the code in five cycles.

We do not make too much of this, because machines differ greatly in their instruction-level parallelism capabilities. For example, an IBM RS/6000 processor from ca. 1992 has a three-input adder, and can execute two consecutive add-type instructions in parallel even when one feeds the other (e.g., an add feeding a compare, or the base register of a load). As a contrary example, consider a simple computer, possibly for low-cost embedded applications, that has only one read port on its register file. Normally, this machine would take an extra cycle to do a second read of the register file for an instruction that has two register input operands. However, suppose it has a bypass so that if an instruction feeds an operand of the immediately following instruction, then that operand is available without reading the register file. On such a machine, it is actually advantageous if each instruction feeds the next梩hat is, if the code has no parallelism.