Friday 18 July 2014

Implementing a fast dictionary for English words using a trie datastructure or radix tree in C#


The following article will present source code for implementing a fast dictionary for English words using a trie datastructure. The trie datastructure is basically a tree with fixed number of children, which in this case is kept as an array of Node instances. The trie is a tree of Node instances and will describe "paths", when resolving words. The application is itself a WPF application and will require .NET 4 or newer to execute. The following source code is the class Node, which is a specific node of the trie datastructure:

using System.Collections.Generic;

namespace AutoCompleteDictionary
{
    
    public class Node
    {

        private readonly Node[] children = new Node[26];

        public IEnumerable<KeyValuePair<Node, char>> AssignedChildren
        {
            get
            {
                for (int i = 0; i < 26; i++)
                {
                    if (children[i] != null)
                        yield return new KeyValuePair<Node, char>(children[i], (char)('a' + i));    
                }
            }
        }

        public Node GetOrCreate(char c)
        {
            Node child = this[c];
            if (child == null)
                child = this[c] = new Node();
            return child; 
        }

        public Node this[char c]
        {
            get { return children[c - 'a']; }
            set { children[c - 'a'] = value; }
        }

        public bool IsWordTerminator { get; set; }

    }

}

The Node class has a fixed sized Node array called children. The readonly property AssignedChildren returns which Node or letters (chars) are present at the current Node or "level", i.e. char position of the words that are mapped into the trie datastructure. In addition, the yield keyword is used here together with an iterator, which is implemented also in this class. The Node class will work with lowercase English letters, but could be adjusted for other alphabets as well. For a Norwegian alphabet, 29 letter would be used as the size of the children to accept the additional three Norwegian vowels, for example. Next, the code for the trie data structure is presented. It is itself a class.

using System.Collections.Generic;

namespace AutoCompleteDictionary
{
    
    public class Trie
    {

        private readonly Node root = new Node();

        public Node NodeForWord(string word, bool createPath)
        {
            Node current = root;

            foreach (char c in word)
            {
                if (createPath)
                    current = current.GetOrCreate(c);
                else
                    current = current[c];

                if (current == null)
                    return null;
            }

            return current;
        }

        public void AddNodeForWord(string word)
        {
            Node node = NodeForWord(word, true);
            node.IsWordTerminator = true; 
        }

        public bool ContainsWord(string word)
        {
            Node node = NodeForWord(word, false);
            return node != null && node.IsWordTerminator; 
        }

        public List PrefixedWords(string prefix)
        {
            var prefixedWords = new List<string>();
            Node node = NodeForWord(prefix, false);
            if (node == null)
                return prefixedWords;

            PrefixedWordsAux(prefix, node, prefixedWords);
            return prefixedWords; 
        }

        private void PrefixedWordsAux(string word, Node node, List<string> prefixedWords)
        {
            if (node.IsWordTerminator)
                prefixedWords.Add(word); 

            foreach (var child in node.AssignedChildren)
            {
                PrefixedWordsAux(word + child.Value, child.Key, prefixedWords); 
            }
        }

    }
}

The trie class or data structure will make use of Node instances as the nodes of its tree datastructure. It adds nodes with the AddNodeForWord method, which also sets the flag IsWordTerminator to true for the specific node. It will loop through the letters or chars of the passed in word or string and build up the Node subtree for the word. It is possible that "paths" are already registered. The method Containsword will not add new nodes but make use of the trie datastructure or Node tree and look if one for a given word ended up with a Node which is not null and that the Node has a flag IsWordTerminator which is true. The method PrefixedWords uses recursion to find "paths" or words that is, in the trie or Node tree that can be reached from the Node that the typed word matches, if any. The recursion will visit the entire subtree of the trie or Node subtree below the Node that matches the typed word, so that it is possible that the calculation will take some considerable type, if the user is not limited to typing at least some letters or chars for the prefix. In the application, three letters is set as a default minimum amount of letters to type. There are about 440,000 letters in this English dictionary. It is not complete, but there is a considerable amount. However, the time to get the matching words with the inputted prefix will for three letters or above typically take a very few milliseconds.
The next class is a simple view model that is used in the WPF client making use of the code above. The WPF xaml view sets the DataContext property to an instance of this view model to have a simple MVVM scenario, or in this case View-ViewModel scenario, as the Model is trivially the view model itself.

using System;
using System.Collections.Generic;
using System.Collections.ObjectModel;
using System.ComponentModel;
using System.Diagnostics;
using System.IO;
using System.Linq; 

namespace AutoCompleteDictionary
{
    
    public class AutoCompleteViewModel : INotifyPropertyChanged
    {

        private Trie trie = new Trie(); 

        private int prefixMinSize;
        public int PrefixMinSize
        {
            get { return prefixMinSize; }
            set
            {
                if (prefixMinSize != value)
                {
                    prefixMinSize = value;
                    RaisePropertyChanged("PrefixMinSize"); 
                }
            }
        }

        private string calculationInfo;
        public string CalculationInfo
        {
            get { return calculationInfo; }
            set
            {
                if (calculationInfo != value)
                {
                    calculationInfo = value;
                    RaisePropertyChanged("CalculationInfo");
                }
            }
        }

        private string inputWord;
        public string InputWord
        {
            get { return inputWord; }
            set
            {
                if (inputWord != value)
                {
                    value = value.ToLower(); 
                    inputWord = string.Join("", value.ToCharArray().Where(c => (int)c >= (int)'a' && (int)c <= (int)'z').ToArray()) ;
                    RaisePropertyChanged("InputWord");
                }
            }
        }

        public ObservableCollection<string> PrefixList { get; set; } 

        public void RaisePropertyChanged(string propertyName)
        {
            if (PropertyChanged != null)
                PropertyChanged(this, new PropertyChangedEventArgs(propertyName)); 
        }


        #region INotifyPropertyChanged Members

        public event PropertyChangedEventHandler PropertyChanged;

        #endregion

        public AutoCompleteViewModel()
        {
            PrefixMinSize = 3;
            InputWord = "Adv";
            PrefixList = new ObservableCollection<string>();
            ProcessDictionaryList();
        }

        private void ProcessDictionaryList()
        {
            foreach (var word in File.ReadLines("english-words"))
            {
                trie.AddNodeForWord(word);
            }
        }

        public void SetPrefixList()
        {
            Stopwatch stopWatch = Stopwatch.StartNew(); 
            PrefixList.Clear();
            var prefixes = GetPrefixList();
            
            foreach (var prefix in prefixes)
            {
                PrefixList.Add(prefix); 
            }
            //RaisePropertyChanged("PrefixList"); 

            CalculationInfo = string.Format("Retrieved {0} words prefixed with {1}. Operation took {2} ms", PrefixList.Count, inputWord, stopWatch.ElapsedMilliseconds); 
        }

        private List GetPrefixList()
        {
            if (InputWord.Length >= PrefixMinSize)
            {
                var wordsStartingWithInputWord = trie.PrefixedWords(InputWord.ToLower());
                return wordsStartingWithInputWord;
            }
            return new List<string>(); 
        }
    }
}


If the English dictionary looks like a nice little toy, it is possible to download the program. A compiled version is in the folder bin\Debug if you want to run the program without compiling the source code. Requires Visual Studio 2012 or newer. Download the English dictionary WPF client presented above using a trie or radix tree data structure here

Implementing a fast point structure in C# for large-scale comparison checks and searches

The code below is extracted from Part I of the Pluralsight course "Making .NET Applications faster", discussed and presented below. Implementing a fast structure for 2D points in C# requires using a struct instead of a class, since this is value-based and not reference type, i.e making use of the stack and not the heap and avoiding expensive header fields of objects. In addition, it is necessary to:
  • Override the Equals method inherited from System.Object
  • Implement a method called Equals that returns true and has one input parameter, another instance of the same struct
  • Mark the struct with the generic IEquatable interface, IEquatable<PointV5>
  • Implement the operators == and != to make use of the Equals method receiving an instance of the struct
  • Implement GetHashCode, using Jon Skeet's advice of creating a weighted sum multiplied by prime numbers and the struct's fields
Implementing this equality regime will reduce overall size in memory about 4x and increase speed 4x to 10x for large scale scenarios. In the example code testing this code, 10 million 2D point structs of type PointV5 was tested. Most modern games will avoid reference types of course and stick to value types, i.e. structs. Since most of us are application developers, make sure if you create a LOT of objects, consider switching from class to struct type(s), if possible. Often, it is not needed to use classes, a struct will be sufficient (and faster and lighter). Check out the course page here: Pluralsight: Making .NET applications faster Pluralsight is a great course and IT resource for IT professionals!

using System;


    struct PointV5 : IEquatable
    {
        public int X;
        public int Y;

        public override bool Equals(object obj)
        {
            if (!(obj is PointV5)) return false;
            PointV5 other = (PointV5)obj;
            return X == other.X && Y == other.Y;
        }

        public bool Equals(PointV5 other)
        {
            return X == other.X && Y == other.Y;
        }

        public static bool operator ==(PointV5 a, PointV5 b)
        {
            return a.Equals(b);
        }

        public static bool operator !=(PointV5 a, PointV5 b)
        {
            return !a.Equals(b);
        }

        public override int GetHashCode()
        {
            // 19 and 29 are primes, and this doesn't assume anything about
            // the distribution of X and Y.
            // Also see http://stackoverflow.com/questions/263400/what-is-the-best-algorithm-for-an-overridden-system-object-gethashcode
            int hash = 19;
            hash = hash * 29 + X;
            hash = hash * 29 + Y;
            return hash;
        }
    }


Actually, a simple class is in fact faster in this scenario, according to the testing I did. Consider this class:

public class PointV0
{

        public int X { get; set; }

        public int Y { get; set; }

}

However, the price on pays here is higher memory overhead, as each point will have to be stored on the heap and have the object header fields and method table pointer fields, i.e. taking up more memory.

Elementary class
        Average time per lookup: 85,70ms
        Garbage collections: 0
Naked struct
        Average time per lookup: 435,10ms
        Garbage collections: 1018
With Equals override
        Average time per lookup: 248,70ms
        Garbage collections: 510
With Equals overload
        Average time per lookup: 239,50ms
        Garbage collections: 510
With IEquatable
        Average time per lookup: 168,60ms
        Garbage collections: 0
All bells and whistles
        Average time per lookup: 170,00ms
        Garbage collections: 0
Press any key to continue ..

I cannot conclude from these results that structs always are faster than classes, but it will always be more memory overhead to resort to classes instead of structs.. It looks though, that in this example, a simple class was the fastest choice!