Friday, 18 July 2014

Implementing a fast dictionary for English words using a trie datastructure or radix tree in C#


The following article will present source code for implementing a fast dictionary for English words using a trie datastructure. The trie datastructure is basically a tree with fixed number of children, which in this case is kept as an array of Node instances. The trie is a tree of Node instances and will describe "paths", when resolving words. The application is itself a WPF application and will require .NET 4 or newer to execute. The following source code is the class Node, which is a specific node of the trie datastructure:

using System.Collections.Generic;

namespace AutoCompleteDictionary
{
    
    public class Node
    {

        private readonly Node[] children = new Node[26];

        public IEnumerable<KeyValuePair<Node, char>> AssignedChildren
        {
            get
            {
                for (int i = 0; i < 26; i++)
                {
                    if (children[i] != null)
                        yield return new KeyValuePair<Node, char>(children[i], (char)('a' + i));    
                }
            }
        }

        public Node GetOrCreate(char c)
        {
            Node child = this[c];
            if (child == null)
                child = this[c] = new Node();
            return child; 
        }

        public Node this[char c]
        {
            get { return children[c - 'a']; }
            set { children[c - 'a'] = value; }
        }

        public bool IsWordTerminator { get; set; }

    }

}

The Node class has a fixed sized Node array called children. The readonly property AssignedChildren returns which Node or letters (chars) are present at the current Node or "level", i.e. char position of the words that are mapped into the trie datastructure. In addition, the yield keyword is used here together with an iterator, which is implemented also in this class. The Node class will work with lowercase English letters, but could be adjusted for other alphabets as well. For a Norwegian alphabet, 29 letter would be used as the size of the children to accept the additional three Norwegian vowels, for example. Next, the code for the trie data structure is presented. It is itself a class.

using System.Collections.Generic;

namespace AutoCompleteDictionary
{
    
    public class Trie
    {

        private readonly Node root = new Node();

        public Node NodeForWord(string word, bool createPath)
        {
            Node current = root;

            foreach (char c in word)
            {
                if (createPath)
                    current = current.GetOrCreate(c);
                else
                    current = current[c];

                if (current == null)
                    return null;
            }

            return current;
        }

        public void AddNodeForWord(string word)
        {
            Node node = NodeForWord(word, true);
            node.IsWordTerminator = true; 
        }

        public bool ContainsWord(string word)
        {
            Node node = NodeForWord(word, false);
            return node != null && node.IsWordTerminator; 
        }

        public List PrefixedWords(string prefix)
        {
            var prefixedWords = new List<string>();
            Node node = NodeForWord(prefix, false);
            if (node == null)
                return prefixedWords;

            PrefixedWordsAux(prefix, node, prefixedWords);
            return prefixedWords; 
        }

        private void PrefixedWordsAux(string word, Node node, List<string> prefixedWords)
        {
            if (node.IsWordTerminator)
                prefixedWords.Add(word); 

            foreach (var child in node.AssignedChildren)
            {
                PrefixedWordsAux(word + child.Value, child.Key, prefixedWords); 
            }
        }

    }
}

The trie class or data structure will make use of Node instances as the nodes of its tree datastructure. It adds nodes with the AddNodeForWord method, which also sets the flag IsWordTerminator to true for the specific node. It will loop through the letters or chars of the passed in word or string and build up the Node subtree for the word. It is possible that "paths" are already registered. The method Containsword will not add new nodes but make use of the trie datastructure or Node tree and look if one for a given word ended up with a Node which is not null and that the Node has a flag IsWordTerminator which is true. The method PrefixedWords uses recursion to find "paths" or words that is, in the trie or Node tree that can be reached from the Node that the typed word matches, if any. The recursion will visit the entire subtree of the trie or Node subtree below the Node that matches the typed word, so that it is possible that the calculation will take some considerable type, if the user is not limited to typing at least some letters or chars for the prefix. In the application, three letters is set as a default minimum amount of letters to type. There are about 440,000 letters in this English dictionary. It is not complete, but there is a considerable amount. However, the time to get the matching words with the inputted prefix will for three letters or above typically take a very few milliseconds.
The next class is a simple view model that is used in the WPF client making use of the code above. The WPF xaml view sets the DataContext property to an instance of this view model to have a simple MVVM scenario, or in this case View-ViewModel scenario, as the Model is trivially the view model itself.

using System;
using System.Collections.Generic;
using System.Collections.ObjectModel;
using System.ComponentModel;
using System.Diagnostics;
using System.IO;
using System.Linq; 

namespace AutoCompleteDictionary
{
    
    public class AutoCompleteViewModel : INotifyPropertyChanged
    {

        private Trie trie = new Trie(); 

        private int prefixMinSize;
        public int PrefixMinSize
        {
            get { return prefixMinSize; }
            set
            {
                if (prefixMinSize != value)
                {
                    prefixMinSize = value;
                    RaisePropertyChanged("PrefixMinSize"); 
                }
            }
        }

        private string calculationInfo;
        public string CalculationInfo
        {
            get { return calculationInfo; }
            set
            {
                if (calculationInfo != value)
                {
                    calculationInfo = value;
                    RaisePropertyChanged("CalculationInfo");
                }
            }
        }

        private string inputWord;
        public string InputWord
        {
            get { return inputWord; }
            set
            {
                if (inputWord != value)
                {
                    value = value.ToLower(); 
                    inputWord = string.Join("", value.ToCharArray().Where(c => (int)c >= (int)'a' && (int)c <= (int)'z').ToArray()) ;
                    RaisePropertyChanged("InputWord");
                }
            }
        }

        public ObservableCollection<string> PrefixList { get; set; } 

        public void RaisePropertyChanged(string propertyName)
        {
            if (PropertyChanged != null)
                PropertyChanged(this, new PropertyChangedEventArgs(propertyName)); 
        }


        #region INotifyPropertyChanged Members

        public event PropertyChangedEventHandler PropertyChanged;

        #endregion

        public AutoCompleteViewModel()
        {
            PrefixMinSize = 3;
            InputWord = "Adv";
            PrefixList = new ObservableCollection<string>();
            ProcessDictionaryList();
        }

        private void ProcessDictionaryList()
        {
            foreach (var word in File.ReadLines("english-words"))
            {
                trie.AddNodeForWord(word);
            }
        }

        public void SetPrefixList()
        {
            Stopwatch stopWatch = Stopwatch.StartNew(); 
            PrefixList.Clear();
            var prefixes = GetPrefixList();
            
            foreach (var prefix in prefixes)
            {
                PrefixList.Add(prefix); 
            }
            //RaisePropertyChanged("PrefixList"); 

            CalculationInfo = string.Format("Retrieved {0} words prefixed with {1}. Operation took {2} ms", PrefixList.Count, inputWord, stopWatch.ElapsedMilliseconds); 
        }

        private List GetPrefixList()
        {
            if (InputWord.Length >= PrefixMinSize)
            {
                var wordsStartingWithInputWord = trie.PrefixedWords(InputWord.ToLower());
                return wordsStartingWithInputWord;
            }
            return new List<string>(); 
        }
    }
}


If the English dictionary looks like a nice little toy, it is possible to download the program. A compiled version is in the folder bin\Debug if you want to run the program without compiling the source code. Requires Visual Studio 2012 or newer. Download the English dictionary WPF client presented above using a trie or radix tree data structure here

1 comment:

  1. Are you trying to make money from your websites/blogs using popup advertisments?
    In case you are, did you know about PopCash?

    ReplyDelete