10.5.1 Straight Merge

Perhaps the most straightforward sorting technique is the straight merge. The straight merge distributes the initial records onto two tapes, a and b, so that they contain the same number of records, or so that one tape contains only one more record than the other. If this is done for the original file in Example 10.5, then the configuration of tapes a and b will appear as the initial distribution.

a   2|17|14|17|50|20|48|16|15|30|16

b  12|16|30| 2|65|32|58|20|10|45|

Now a and b can be thought of as being composed of subfiles, each of which is sorted and of length 1. A subfile is a run of length r if it consists of r records sorted in order. Here, a contains eleven subfiles, or runs, of length 1, and b contains ten runs of length 1.

Apply a merge procedure to each pair of runs, each pair containing a run from tape a and the corresponding run from tape b. Thus

 2 17 14

12 16 30, etc.

are "paired." The merge procedure consecutively merges each pair (consisting of a sorted subfile of length 1 from a, and a corresponding sorted subfile of length 1 from b). The resultant merged files are written consecutively to tape c. The result of the first merge of a and b to c is

c 2 12|16 17|14 30|2 17|50 65|20 32|48 58|16 20|10 15|30 45|16

Think of c as containing ten subfiles or runs of length 2 and one run of length 1 and distribute these to a and b equally to obtain the second distribution.

a  2 12|14 30|50 65|48 58|10 15|16

b 16 17| 2 17|20 32|16 20|30 45|

By now there are five paired runs of length 2 and one run of length 1. These may again be consecutively merged, and written to c to obtain the second merge of a and b to c.

c 2 12 16 17|2 14 17 30|20 32 50 65|16 20 48 58|10 15 30 45|16

Another distribution to a and b yields the third distribution.

a 2 12 16 17|20 32 50 65|10 15 30 45

b 2 14 17 30|16 20 48 58|16

These pairs of length 4 can now be merged to obtain the third merge of a and b to c.

c 2 2 12 14 16 17 17 30|16 20 20 32 48 50 58 65|10 15 16 30 45

It takes two more distributions and merges to complete the sort. The fourth distribution

a  2  2 12 14 16 17 17 30|10 15 16 30 45

b 16 20 20 32 48 50 58 65|

is followed by the fourth merge.

c 2 2 12 14 16 16 17 17 20 20 30 32 48 50 58 65|10 15 16 30 45

The fifth distribution is next. It is followed by the fifth merge.

a  2  2 12 14 16 16 17 17 20 20 30 32 48 50 58 65

b 10 15 16 30 45

c  2  2 10 12 14 15 16 16 16 17 17 20 20 30 32 45 48 50 58 65

This completes the sort, which has taken five distributions and five merges. The sort is described as a five-pass sort, with five distribution phases and five merge phases, using three tapes. Programming this process is not a trivial task. Ignoring details, the basic time requirements can be seen to be determined by the number of times files must be rewound or rewritten and the number of records that must be accessed from secondary storage (tapes). This assumes that the internal memory processing will be done so quickly that it does not add to the total time for the sort. Before the original file, c, can be distributed (rewritten) to tapes a and b. those tapes must be rewound and the original file must be rewound. Before a and b can be merged (and written) to c, they must be rewound, and c must be rewound. Also, in each phase, whether distribution or merge, each record must be accessed from secondary storage. Thus the total time is proportional to the number of passes (five here). If the access time of a record is Ta, and the rewind time Tr, then the total time is 5 ?/FONT> (2 ?/FONT> 21 x Ta + 3 ?/FONT> Tr). Note the factor of 2 associated with the twenty-one records and the access time. The factor is 2 because there are two phases, distribution and merge.