7-2 Shuffling Bits

Another important permutation of the bits of a word is the "perfect shuffle" operation, which has applications in cryptography. There are two varieties, called the "outer" and "inner" perfect shuffles. They both interleave the bits in the two halves of a word in a manner similar to a perfect shuffle of a deck of 32 cards, but they differ in which card is allowed to fall first. In the outer perfect shuffle, the outer (end) bits remain in the outer positions, and in the inner perfect shuffle, bit 15 moves to the left end of the word (position 31). If the 32-bit word is (where each letter denotes a single bit)

abcd efgh ijkl mnop ABCD EFGH IJKL MNOP, 

then after the outer perfect shuffle it is

aAbB cCdD eEfF gGhH iIjJ kKlL mMnN oOpP, 

and after the inner perfect shuffle it is

AaBb CcDd EeFf GgHh IiJj KkLl MmNn OoPp. 

Assume the word size W is a power of 2. Then the outer perfect shuffle operation can be accomplished with basic RISC instructions in log2(W/2) steps, where each step swaps the second and third quartiles of successively smaller pieces [GLS1]. That is, a 32-bit word is transformed as follows:

abcd efgh ijkl mnop ABCD EFGH IJKL MNOP 
abcd efgh ABCD EFGH ijkl mnop IJKL MNOP 
abcd ABCD efgh EFGH ijkl IJKL mnop MNOP 
abAB cdCD efEF ghGH ijIJ klKL mnMN opOP 
aAbB cCdD eEfF gGhH iIjJ kKlL mMnN oOpP 

Straightforward code for this is

x = (x & 0x0000FF00) << 8 | (x >> 8) & 0x0000FF00 | x & 0xFF0000FF; 
x = (x & 0x00F000F0) << 4 | (x >> 4) & 0x00F000F0 | x & 0xF00FF00F; 
x = (x & 0x0C0C0C0C) << 2 | (x >> 2) & 0x0C0C0C0C | x & 0xC3C3C3C3; 
x = (x & 0x22222222) << 1 | (x >> 1) & 0x22222222 | x & 0x99999999; 

which requires 42 basic RISC instructions. This can be reduced to 30 instructions, although at an increase from 17 to 21 cycles on a machine with unlimited instruction-level parallelism, by using the exclusive or method of exchanging two fields of a register (described on page 40). All quantities are unsigned:

t = (x ^ (x >> 8)) & 0x0000FF00;?x = x ^ t ^ (t << 8); 
t = (x ^ (x >> 4)) & 0x00F000F0;?x = x ^ t ^ (t << 4); 
t = (x ^ (x >> 2)) & 0x0C0C0C0C;?x = x ^ t ^ (t << 2); 
t = (x ^ (x >> 1)) & 0x22222222;?x = x ^ t ^ (t << 1); 

The inverse operation, the outer unshuffle, is easily accomplished by performing the swaps in reverse order:

t = (x ^ (x >> 1)) & 0x22222222;?x = x ^ t ^ (t << 1); 
t = (x ^ (x >> 2)) & 0x0C0C0C0C;?x = x ^ t ^ (t << 2); 
t = (x ^ (x >> 4)) & 0x00F000F0;?x = x ^ t ^ (t << 4); 
t = (x ^ (x >> 8)) & 0x0000FF00;?x = x ^ t ^ (t << 8); 

Using only the last two steps of either of the above two shuffle sequences shuffles the bits of each byte separately. Using only the last three steps shuffles the bits of each halfword separately, and so on. Similar remarks apply to unshuffling, except by using the first two or three steps.

To get the inner perfect shuffle, prepend to these sequences a step to swap the left and right halves of the register:

x = (x >> 16) | (x << 16); 

(or use a rotate of 16 bit positions). The unshuffle sequence can be similarly modified by appending this line of code.

Altering the transformation to swap the first and fourth quartiles of successively smaller pieces produces the bit reversal of the inner perfect shuffle.

Perhaps worth mentioning is the special case in which the left half of the word x is all 0. In other words, we want to move the bits in the right half of x to every other bit position梩hat is, to transform the 32-bit word

0000 0000 0000 0000 ABCD EFGH IJKL MNOP 

to

0A0B 0C0D 0E0F 0G0H 0I0J 0K0L 0M0N 0O0P. 

The outer perfect shuffle code can be simplified to do this task in 22 basic RISC instructions. The code below, however, does it in only 19 basic RISC instructions, at no cost in execution time on a machine with unlimited instruction-level parallelism (12 cycles with either method). This code does not require that the left half of word x be initially cleared.

x = ((x & 0xFF00) << 8) | (x & 0x00FF); 
x = ((x << 4) | x) & 0x0F0F0F0F; 
x = ((x << 2) | x) & 0x33333333; 
x = ((x << 1) | x) & 0x55555555; 

Similarly, for the inverse of this "half shuffle" operation (a special case of compress; see page 116), the outer perfect unshuffle code can be simplified to do the task in 26 or 29 basic RISC instructions, depending on whether or not an initial and operation is required to clear the bits in the odd positions. The code below, however, does it in only 18 or 21 basic RISC instructions, and with less execution time on a machine with unlimited instruction-level parallelism (12 or 15 cycles).

x = x & 0x55555555;牋牋牋牋?// (If required.) 
x = ((x >> 1) | x) & 0x33333333; 
x = ((x >> 2) | x) & 0x0F0F0F0F; 
x = ((x >> 4) | x) & 0x00FF00FF; 
x = ((x >> 8) | x) & 0x0000FFFF;