Searching and Sorting
Consider searching for a given value v in an array of size N. There are 2 basic approaches: sequential search and binary search.
Sequential search involves looking at each value in turn (i.e., start with the value in array, then array, etc). The algorithm quits and returns true if the current value is v; it quits and returns false if it has looked at all of the values in the array without finding v.
If the values are in sorted order, then the algorithm can sometimes quit and return false without having to look at all of the values in the array: v is not in the array if the current value is greater than v.
The worst-case time for a sequential search is always O(N)
When the values are in sorted order, a better approach than the one given above is to use binary search. The algorithm for binary search starts by looking at the middle item x. If x is equal to v, it quits and returns true. Otherwise, it uses the relative ordering of x and v to eliminate half of the array (if v is less than x, then it can't be stored to the right of x in the array; similarly, if it is greater than x, it can't be stored to the left of x). Once half of the array has been eliminated, the algorithm starts again by looking at the middle item in the remaining half. It quits when it finds v or when the entire array has been eliminated.
The worst-case time for binary search is proportional to log2 N: the number of times N can be divided in half before there is nothing left. Using big-O notation, this is O(log N). Note that binary search in an array is basically the same as doing a lookup in a perfectly balanced binary-search tree (the root of a balanced BST is the middle value). In both cases, if the current value is not the one we're looking for, we can eliminate half of the remaining values.
Consider sorting the values in an array A of size N. Most sorting algorithms involve what are called comparison sorts; i.e., they work by comparing values. Comparison sorts can never have a worst-case running time less than O(N log N). Simple comparison sorts are usually O(N2); the more clever ones are O(N log N).
Three interesting issues to consider when thinking about different sorting algorithms are:
- Does an algorithm always take its worst-case time?
- What happens on an already-sorted array?
- How much space (other than the space for the array itself) is required?
We will discuss four comparison-sort algorithms :
- selection sort
- insertion sort
- merge sort
- quick sort
Selection sort and insertion sort have worst-case time O(N2). Quick sort is also O(N2) in the worst case, but its expected time is O(N log N). Merge sort is O(N log N) in the worst case.
The idea behind selection sort is:
- Find the smallest value in A; put it in A.
- Find the second smallest value in A; put it in A. etc.
The approach is as follows:
- Use an outer loop from 0 to N-1 (the loop index, k, tells which position in A to fill next).
- Each time around, use a nested loop (from k+1 to N-1) to find the smallest value (and its index) in the unsorted part of the array.
- Swap that value with A[k].
Note that after i iterations, A through A[i-1] contain their final values (so after N iterations, A through A[N-1] contain their final values and we're done!)
The idea behind insertion sort is:
- Put the first 2 items in correct relative order.
- Insert the 3rd item in the correct place relative to the first 2.
- Insert the 4th item in the correct place relative to the first 3. etc.
As for selection sort, a nested loop is used; however, a different invariant holds: after the ith time around the outer loop, the items in A through A[i-1] are in order relative to each other (but are not necessarily in their final places). Also, note that in order to insert an item into its place in the (relatively) sorted part of the array, it is necessary to move some values to the right to make room.
As mentioned above, merge sort takes time O(N log N), which is quite a bit better than the two O(N2) sorts described above (for example, when N=1,000,000, N2=1,000,000,000,000, and N log2 N = 20,000,000; i.e., N2 is 50,000 times larger than N log N!).
The key insight behind merge sort is that it is possible to merge two sorted arrays, each containing N/2 items to form one sorted array containing N items in time O(N). To do this merge, you just step through the two arrays, always choosing the smaller of the two values to put into the final array (and only advancing in the array from which you took the smaller value).
Now the question is, how do we get the two sorted arrays of size N/2? The answer is to use recursion; to sort an array of length N:
- Divide the array into two halves.
- Recursively, sort the left half.
- Recursively, sort the right half.
- Merge the two sorted halves.
The base case for the recursion is when the array to be sorted is of length 1 -- then it is already sorted, so there is nothing to do. Note that the merge step (step 4) needs to use an auxiliary array (to avoid overwriting its values). The sorted values are then copied back from the auxiliary array to the original array.
Quick sort (like merge sort) is a divide and conquer algorithm: it works by creating two problems of half size, solving them recursively, then combining the solutions to the small problems to get a solution to the original problem. However, quick sort does more work than merge sort in the "divide" part, and is thus able to avoid doing any work at all in the "combine" part!
The idea is to start by partitioning the array: putting all small values in the left half and putting all large values in the right half. Then the two halves are (recursively) sorted. Once that's done, there's no need for a "combine" step: the whole array will be sorted!
The key question is how to do the partitioning? Ideally, we'd like to put exactly half of the values in the left part of the array, and the other half in the right part; i.e., we'd like to put all values less than the median value in the left and all values greater than the median value in the right. However, that requires first computing the median value (which is too expensive). Instead, we pick one value to be the pivot, and we put all values less than the pivot to its left, and all values greater than the pivot to its right (the pivot itself is then in its final place).
Here's the algorithm outline:
- Choose a pivot value.
- Partition the array (put all value less than the pivot in the left part of the array, then the pivot itself, then all values greater than the pivot).
- Recursively, sort the values less than the pivot.
- Recursively, sort the values greater than the pivot.
Note that, as for merge sort, we need an auxiliary method with two extra parameters -- low and high indexes to indicate which part of the array to sort. Also, although we could "recurse" all the way down to a single item, in practice, it is better to switch to a sort like insertion sort when the number of items to be sorted is small (e.g., 20).
Now let's consider how to choose the pivot item. (Our goal is to choose it so that the "left part" and "right part" of the array have about the same number of items -- otherwise we'll get a bad runtime).