Sunday, 18 April 2021

Implementing a Strip method with regex in C#

This article will present a Strip method that accepts a Regex which defines the pattern of allowed characters. It is similar to Regex Replace, but it works in the inverted way. Instead of removing the chars matching the pattern in Regex.Replace, this utility method instead lets you define the allowed chars, i.e. these chars defined in this regex are the chars I want to keep. First off we define the utility method, as an extension method.
 

        /// <summary>
        /// Strips away every character not defined in the provided regex <paramref name="allowedChars"/>
        /// </summary>
        /// <param name="s">Input string</param>
        /// <param name="allowedChars">The allowed characters defined in a Regex with pattern, for example: [A-z|0-9]+/</param>
        /// <returns>Input string with only the allowed characters</returns>
        public static string Strip(this string s, Regex allowedChars)
        {
            if (s == null)
            {
                return s;
            }
            if (allowedChars == null)
            {
                return string.Empty;
            }
            Match match = Regex.Match(s, allowedChars.ToString());
            List<char> allowedAlphabet = new List<char>();
            while (match.Success)
            {
                if (match.Success)
                {
                    for (int i = 0; i < match.Groups.Count; i++)
                    {
                        allowedAlphabet.AddRange(match.Groups[i].Value.ToCharArray());
                    }
                }
                match = match.NextMatch();
            }
            return new string(s.Where(ch => allowedAlphabet.Contains(ch)).ToArray());
        }
        
          
Here are some tests that tests out this Strip method:
 
 
 	 	[Test]
        [TestCase("abc123abc", "[A-z]+", "abcabc")]
        [TestCase("abc123def456", "[0-9]+", "123456")]
     	[TestCase("The F-32 Lightning II is a newer generation fighter jets than the F-16 Fighting Falcon", "[0-9]+", "3216")]
		[TestCase("Here are some Norwegian letters : ÆØÅ and in lowercase: æøå", "[æ|ø|å]", "æøå")]
		public void TestStripWithRegex(string input, string regexString, string expectedOutput)
        {
            var regex = new Regex(regexString);
            input.Strip(regex).Should().Be(expectedOutput);
        }
 
 

1 comment:

  1. Pro tip: Forsøk gjerne IndexOf >= 0 i stedet for contains her. En kan også gjøre distinct før AddRange. Da skal det skalere bedre.

    ReplyDelete