Coding Grounds: Splitting a ReadOnlySpan by a separator

Monday, 22 August 2022

Splitting a ReadOnlySpan by a separator

This article will look into using the ReadonlySpan and doing an equivalent string split operation. The method uses ReadOnlySpan of type T (char) to split into 'words' or 'tokens', separated by a split char (separator). A span was introduced in C# 7.3 and can be used in .net standard 2.0 or .net framework using the Nuget package System.Memory.



  <ItemGroup>
    <PackageReference Include="System.Memory" Version="4.5.5" />
  </ItemGroup>

If you use newer target frameworks, you will get Span included as long as C# 7.3 is supported. The code here is just demonstration code, it successfully splits a ReadOnlySpan of T (char) using the contiguous memory on the stack, however, we must still convert to string here the 'tokens' or 'words' after the split operation. Note the usage of Slice method here to retrieve a range from the ReadOnlySpan of char here, we use a List of string to get the words or 'tokens'. It would be nice to somehow avoid string as much as possible, but we want to have an array of strings back anyways, so a List of string is used here to get the 'tokens'. What would be optimal would be to just return the split indexes as the code already extracts here and return those split indexes, which later could be used to build a string array. We have all the characters in the ReadOnlySpan of char here, so only having the split indexes would be sufficient. However, this would
from the consumer side be a bit cumbersome. You could though have a method like 'get nth word' using the split indexes here and so on.

 

using System;
using System.Collections.Generic;

namespace SpanStringSplit
{
    public static class SpanExtensions
    {

        public static string[] SplitViaSpan(this string input, char splitChar, StringSplitOptions splitOptions)
        {
            if (string.IsNullOrWhiteSpace(input) || input.IndexOf(splitChar) < 0)
            {
                return new string[] { input };
            }
            var tokens = SplitSpan(input.AsSpan(), splitChar, splitOptions);
            return tokens; 
        }

        public static string[] SplitSpan(this ReadOnlySpan<char> inputSpan, char splitChar, StringSplitOptions splitOptions)
        {
            if (inputSpan == null)
            {
                return new string[] { null };
            }
            if (inputSpan.Length == 0)
            {
                return splitOptions == StringSplitOptions.None ? new string[] { string.Empty } : new string[0]; 
            }
            bool isSplitCharFound = false; 
            foreach (char letter in inputSpan)
            {
                if (letter == splitChar)
                {
                    isSplitCharFound = true;
                    break;
                }
            }
            if (!isSplitCharFound)
            {
                return new string[] { inputSpan.ToString() }; 
            }

            bool IsTokenToBeAdded(string token) => !string.IsNullOrWhiteSpace(token) || splitOptions == StringSplitOptions.None;

            var splitIndexes = new List<int>();
            var tokens = new List<string>();
            int charIndx = 0;
            foreach (var ch in inputSpan)
            {
                if (ch == splitChar)
                {
                    splitIndexes.Add(charIndx);
                }
                charIndx++;
            }
            int currentSplitIndex = 0;
            foreach (var indx in splitIndexes)
            {
                if (currentSplitIndex == 0)
                {
                    string firstToken = inputSpan.Slice(0, splitIndexes[0]).ToString();
                    if (IsTokenToBeAdded(firstToken))
                    {
                        tokens.Add(firstToken);
                    }
                }
                else if (currentSplitIndex <= splitIndexes.Count)
                {
                    string intermediateToken = inputSpan.Slice(splitIndexes[currentSplitIndex - 1] + 1, splitIndexes[currentSplitIndex] - splitIndexes[currentSplitIndex - 1] - 1).ToString();
                    if (IsTokenToBeAdded(intermediateToken))
                    {
                        tokens.Add(intermediateToken);
                    }
                }
                currentSplitIndex++;
            }
            string lastToken = inputSpan.Slice(splitIndexes[currentSplitIndex - 1] + 1).ToString();
            if (IsTokenToBeAdded(lastToken))
            {
                tokens.Add(lastToken);
            }
            return tokens.ToArray();
        }

    }
}

And we have our suceeding unit tests :

 

using NUnit.Framework;
using System;

namespace SpanStringSplit.Test
{
    [TestFixture]
    public class SpanExtensionsSpec
    {
        [Test]
        public void SplitStringsViaSpan()
        {
            var tokens = ",,The,quick,brown,fox,jumped,over,the,lazy,,dog".SplitViaSpan(',', StringSplitOptions.RemoveEmptyEntries);
            CollectionAssert.AreEqual(new string[] { "The", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog" }, tokens);
        }

        [Test]
        public void SplitStringsUsingSpan()
        {
            ReadOnlySpan<char> s = ",,The,quick,brown,fox,jumped,over,the,lazy,,dog".ToCharArray();                
            var tokens = s.SplitSpan(',', StringSplitOptions.RemoveEmptyEntries);
            CollectionAssert.AreEqual(new string[] { "The", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog" }, tokens);
        }

    }
}

To sum up - we can use Span here to get an contiguous space of memory (stack in this case). To get a span from a string we use the extension method 'AsSpan()' To get a string from a range of a span (Slice), we just use the Slice method and then call ToString().

The following code then extracts the nth token (or word) only using the stack and first extracts the split indices (considering up to a given index + 1 of split indices if available) and then using the Splice method to get the chars of the token or 'word'.

 


        public static string GetNthToken(this ReadOnlySpan<char> inputSpan, char splitChar, int nthToken)
        {
            if (inputSpan == null)
            {
                return null;
            }
            int[] splitIndexes = inputSpan.SplitIndexes(splitChar, nthToken); 
            if (splitIndexes.Length == 0)
            {
                return inputSpan.ToString();
            }
            if (nthToken == 0 && splitIndexes.Length > 0)
            {
                return inputSpan.Slice(0, splitIndexes[0]).ToString(); 
            }
            if (nthToken > splitIndexes.Length)
            {
                return null; 
            }
            if (nthToken == splitIndexes.Length)
            {
                var split = inputSpan.Slice(splitIndexes[nthToken-1]+1).ToString();
                return split; 
            }
            if (nthToken <= splitIndexes.Length + 1)
            {
                var split = inputSpan.Slice(splitIndexes[nthToken-1]+1, splitIndexes[nthToken] - splitIndexes[nthToken-1]-1).ToString();
                return split; 
            }
            return null; 

        }

        public static int[] SplitIndexes(this ReadOnlySpan<char> inputSpan, char splitChar,
            int? highestSplitIndex = null)
        {
            if (inputSpan == null)
            {
                return Array.Empty<int>();
            }
            if (inputSpan.Length == 0)
            {
                return Array.Empty<int>();
            }
            bool isSplitCharFound = false;
            foreach (char letter in inputSpan)
            {
                if (letter == splitChar)
                {
                    isSplitCharFound = true;
                    break;
                }
            }
            if (!isSplitCharFound)
            {
                return Array.Empty<int>();
            }
         
            var splitIndexes = new List<int>();
            var tokens = new List<string>();
            int charIndex = 0;
            foreach (var ch in inputSpan)
            {
                if (ch == splitChar)
                {
                    if (highestSplitIndex.HasValue && highestSplitIndex + 1 < splitIndexes.Count)
                    {
                        break; 
                    }
                    splitIndexes.Add(charIndex);
                }
                charIndex++; 
            }
            return splitIndexes.ToArray(); 
        }

Now why would you use this instead of just sticking to ordinary string class methods. The main goal was to look into Span and how we can use it to look at contiguous memory and work at sub parts of this memory using the Slice method. In some applications, such as games and graphics in general, such micro optimizations are more important to avoid allocating a lot of string variables. Finding the split incides first (up to a given index if available) and then retrieving the nth token or word can be very useful instead of spitting into an array of strings. The unit tests are also passing for GetNthToken method :

 
        [Test]
        [TestCase(",, The, quick, brown, fox, jumped, over, the, lazy,, dog", 5, "fox")]
        [TestCase(",, The, quick, brown, fox, jumped, over, the, lazy,, dog", 0, "")]
        [TestCase(",, The, quick, brown, fox, jumped, over, the, lazy,, dog", 1, "")]
        [TestCase(",, The, quick, brown, fox, jumped, over, the, lazy,, dog", 2, "The")]
        [TestCase(",, The, quick, brown, fox, jumped, over, the, lazy,, dog", 3, "quick")]
        [TestCase(",, The, quick, brown, fox, jumped, over, the, lazy,, dog", 7, "over")]
        [TestCase(",, The, quick, brown, fox, jumped, over, the, lazy,, dog", 11, "dog")]
        [TestCase(",, The, quick, brown, fox, jumped, over, the, lazy,, dog", 12, null)]
        [TestCase(",, The, quick, brown, fox, jumped, over, the, lazy,, dog", 13, null)]
        public void GetNthWord(string input, int nthWord, string expectedWord)
        {
            ReadOnlySpan<char> s = ",,The,quick,brown,fox,jumped,over,the,lazy,,dog".ToCharArray();
            var word = s.GetNthToken(',', nthWord);
            Assert.AreEqual(word, expectedWord); 
        }

Coding Grounds

Monday, 22 August 2022

Splitting a ReadOnlySpan by a separator

Share this article on LinkedIn.

No comments:

Post a Comment

Followers

Total Pageviews