If data to be sorted doesn’t fit into main memory external sorting is applicable. External merge sort can be separated into two phases:
To implement this algorithm I will use solutions from my previous posts so it may be helpful for you to look at them:
Let’s assume that M records at the same time are allowed to be loaded into main memory. One of the ways to create initial runs is to successively read M records from original file, sort them in memory and write back to disk. However we will use approach that allows us to create longer runs. It is called replacement selection.
The core structure behind this algorithm is priority queue. Taking one by one current minimum element out of the queue forms ordered sequence. And this is exactly what run stands for. The algorithm can be described as follows:
At any given moment at most M records are loaded into main memory as single written element into current run is replaced with single element from unsorted file if any (depending on comparison it either goes into current or next run).
Next step is to merge created initial runs. For the merge step we will use simplified algorithm (more advanced algorithms work with multiple physical devices to distribute runs, take into account data locality, etc.) based on k-way merge:
Yeap, it is that simple. And let’s code it.
The implementation abstracts file structure and reading/writing details making algorithm more concise and easier to understand.
abstract class ExternalSorter<T> { private readonly IComparer<T> m_comparer; private readonly int m_capacity; private readonly int m_mergeCount; protected ExternalSorter(IComparer<T> comparer, int capacity, int mergeCount) { m_comparer = comparer; m_capacity = capacity; m_mergeCount = mergeCount; } // Sorts unsorted file and returns sorted file name public string Sort(string unsorted) { var runs = Distribute(unsorted); return Merge(runs); } // Write run to disk and return created file name protected abstract string Write(IEnumerable<T> run); // Read run from file with given name protected abstract IEnumerable<T> Read(string name); // Merge step in this implementation is simpler than // the one used in polyphase merge sort - it doesn't // take into account distribution over devices private string Merge(IEnumerable<string> runs) { var queue = new Queue<string>(runs); var runsToMerge = new List<string>(m_mergeCount); // Until single run is left do merge while (queue.Count > 1) { // Priority queue must not contain records more than // required var count = m_mergeCount; while (queue.Count > 0 && count-- > 0) runsToMerge.Add(queue.Dequeue()); // Perform n-way merge on selected runs where n is // equal to number of physical devices with // distributed runs but in our case we do not take // into account them and thus n is equal to capacity var merged = runsToMerge.Select(Read).OrderedMerge(m_comparer); queue.Enqueue(Write(merged)); runsToMerge.Clear(); } // Last run represents source file sorted return queue.Dequeue(); } // Distributes unsorted file into several sorted chunks // called runs (run is a sequence of records that are // in correct relative order) private IEnumerable<string> Distribute(string unsorted) { var source = Read(unsorted); using (var enumerator = source.GetEnumerator()) { var curr = new PriorityQueue<T>(m_comparer); var next = new PriorityQueue<T>(m_comparer); // Prefill priority queue to capacity which is used // to create runs while (curr.Count < m_capacity && enumerator.MoveNext()) curr.Enqueue(enumerator.Current); // Until unsorted source and priority queues are // exhausted while (curr.Count > 0) { // Create next run and write it to disk var sorted = CreateRun(enumerator, curr, next); var run = Write(sorted); yield return run; Swap(ref curr, ref next); } } } private IEnumerable<T> CreateRun(IEnumerator<T> enumerator, PriorityQueue<T> curr, PriorityQueue<T> next) { while (curr.Count > 0) { var min = curr.Dequeue(); yield return min; // Trying to move run to an end enumerator will // result in returning false and thus current // queue will simply be emptied step by step if (!enumerator.MoveNext()) continue; // Check if current run can be extended with // next element from unsorted source if (m_comparer.Compare(enumerator.Current, min) < 0) { // As current element is less than min in // current run it may as well be less than // elements that are already in the current // run and thus from this element goes into // next run next.Enqueue(enumerator.Current); } else { // Extend current run curr.Enqueue(enumerator.Current); } } } private static void Swap<U>(ref U a, ref U b) { var tmp = a; a = b; b = tmp; } }
In the example below I created type that sorts text files containing single number per line.
class TextFileOfNumbersExternalSorter : ExternalSorter<int> { public TextFileOfNumbersExternalSorter(int capacity, int mergeCount) : base(Comparer<int>.Default, capacity, mergeCount) { } protected override string Write(IEnumerable<int> run) { var file = Path.GetTempFileName(); using (var writer = new StreamWriter(file)) { run.Run(writer.WriteLine); } return file; } protected override IEnumerable<int> Read(string name) { using (var reader = new StreamReader(name)) { while (!reader.EndOfStream) yield return Int32.Parse(reader.ReadLine()); } File.Delete(name); } }
That is used like this:
// capacity, mergeCount and unsortedFileName are initialized elsewhere var sorter = new TextFileOfNumbersExternalSorter(capacity, mergeCount); var sortedFileName = sorter.Sort(unsortedFileName);
That’s it folks!
Like most of life's problems, this one can be solved with bending - Bender B.Rodrigues
Like most of life's problems, this one can be solved with bending
- Bender B.Rodrigues
Let’s bend another problem. Given set of characters P and string T find minimum window in T that contains all characters in P. Applicable solution is restricted to O(length(T)) time complexity. For example, given a string T “of characters and as” and set of characters T in a form of a string “aa s” the minimum window will be “and as”.
The problem can be broken into two parts:
Selecting every possible window (all unique pairs (i, j) where 0 <= i <= j < length(T)) will lead to solution worse than O(length(T)^2) because you still need to check if all characters from P are within selected window. Instead we will check every possible window ending position. Thus there are at least length(T) windows to consider.
Any feasible window has length equal to or greater than length(P). Performing recheck for any considered window will result in a solution no better than O(length(T)*length(P)). Instead we need to use check results from previous iteration.
Now we need to make sure that checking if a particular character is in P is done in an optimal way. Taking into account that a particular character may appear more than once and window thus must contain appropriate number of characters. We will use hash table to map unique characters from P to their count for fast lookup.
And now let’s tie all things together.
Code the thing! =)
static string FindMinWindow(string t, string p) { // Create char to count mapping for fast lookup // as some characters may appear more than once var charToCount = new Dictionary<char, int>(); foreach (var c in p) { if (!charToCount.ContainsKey(c)) charToCount.Add(c, 0); charToCount[c]++; } var unmatchesCount = p.Length; int minWindowLength = t.Length + 1, minWindowStart = -1; int currWindowStart = 0, currWindowEnd = 0; for (; currWindowEnd < t.Length; currWindowEnd++) { var c = t[currWindowEnd]; // Skip chars that are not in P if (!charToCount.ContainsKey(c)) continue; // Reduce unmatched characters count charToCount[c]--; if (charToCount[c] >= 0) // But do this only while count is positive // as count may go negative which means // that there are more than required characters unmatchesCount--; // No complete match, so continue searching if (unmatchesCount > 0) continue; // Decrease window as much as possible by removing // either chars that are not in T or those that // are in T but there are too many of them c = t[currWindowStart]; var contains = charToCount.ContainsKey(c); while (!contains || charToCount[c] < 0) { if (contains) // Return character to P charToCount[c]++; c = t[++currWindowStart]; contains = charToCount.ContainsKey(c); } if (minWindowLength > currWindowEnd - currWindowStart + 1) { minWindowLength = currWindowEnd - currWindowStart + 1; minWindowStart = currWindowStart; } // Remove last char from window - it is definitely in a // window because we stopped at it during decrease phase charToCount[c]++; unmatchesCount++; currWindowStart++; } return minWindowStart > -1 ? t.Substring(minWindowStart, minWindowLength) : String.Empty; }
Every character is examined at most twice (during appending to the end and during compaction) so the whole solution has O(length(T)) time complexity assuming hash table lookup is O(1) operation.
Recently I came across simple yet interesting coding problem. So here is the deal. You are given positive integer N. Print first N ^ 2 positive integers in matrix form in a such a way that within matrix numbers form spiral starting from its center and goring clockwise. For example, for N = 5 matrix to be printed is:
Optimize it for speed and space.
One way you can approach it is to create N x N matrix and fill it with numbers that form spiral and then print whole matrix row by row. But this solution will be of N ^ 2 space complexity. Let’s try to reach O(1) space complexity.
The key observation here is how matrix changes when N changes by 1.
N = 1.
N = 2.
N = 3.
N = 4.
Can you see the pattern here? At every step we extend previous matrix (P) with additional column and row (C). If N is even we extend previous matrix of size N – 1 with right column and bottom row
and with left column and top row if it is odd
This leads us to naturally recursive algorithm. We have three cases:
So basically to print a row we need to know matrix size N and row index. Here goes the solution.
static void Print(int n) { for(int i = 0; i < n; i++) { PrintLine(n, i); Console.WriteLine(); } } static void PrintLine(int n, int i) { // Number of integers in current matrix var n2 = n*n; // Number of itegers in previous matrix of size n - 1 var m2 = (n - 1)*(n - 1); if (n % 2 == 0) { if (i == n - 1) { // n is even and we are at the last row so just // print it for(int k = n2; k > n2 - n; k--) { PrintNum(k); } } else { // Print row from previous matrix of size n - 1 // first and then print value that belongs to current // matrix. Previous matrix is at the top left corner // so no need to adjust row index PrintLine(n - 1, i); // Skip all integers from previous matrix and upper // ones in this columnas integers must form clockwise // spiral PrintNum(m2 + 1 + i); } } else { if (i == 0) { // n is odd and we are at the first row so just // print it for(int k = m2 + n; k <= n2; k++) { PrintNum(k); } } else { // Print value that belongs to current matrix and // then print row from previous matrix of size n - 1 // Skip all integers from previous matric and bottom // ones in this column as integers must form clockwise // spiral PrintNum(m2 + n - i); // Previous matrix is at the bottom right corner so // row index must be reduced by 1 PrintLine(n - 1, i - 1); } } } static void PrintNum(int n) { Console.Write("{0, -4} ", n); }
If stack is not considered then this solution has O(1) space complexity otherwise O(N).