I ran a test at work the other day comparing different ways to try to optimize regular expressions in C#. Then I wondered how that compares with Rust. My use case was a series of some pretty gnarly regexes with backtracing, backreferences, and multiple named groupings. I ran ten or so of them sequentially on each input which might be in the 10-100 KiB range, and in total, I had maybe tens or hundreds of thousands of such inputs. So we’re not talking about saving a few milliseconds - potential performance gains here would be nontrivial.

Rather than running the same test for work, I wanted to simulate something that I could post publicly - even if it’s not the same thing. So I started by grabbing the first partial file of the February 20 FR Wikimedia dump, which is 2.1 GiB decompressed, 22.6M lines of XML.

Then, completely arbitrarily, I wrote some regular expressions to detect the <contributor> opening tag, which seems to be followed always by either a <username> or an <ip>, an the IP is either v4 or v6.

      <timestamp>2021-01-14T22:26:27Z</timestamp>
      <contributor>
        <username>Freddo</username>
        <id>72266</id>
      </contributor>
      <minor />
      <comment>/* Fêtes et jours fériés */</comment>

My simple test runs one regex to find the contributor tag, one to check for IP, and on an IP match, runs one to check for IPv4 or another for IPv6; on an IP mismatch, it runs one last regex to check for the username. It’s not what you’d run in “real” code, but it’s an okay proxy.

The Rust code, using regex = "1.4.3", is:

use regex::Regex;
use std::{
    fs,
    io::{BufRead, BufReader, Read},
};

fn main() {
    let contributor_re = Regex::new("<contributor>").unwrap();
    let username_re = Regex::new("<username>([^<]+)</username>").unwrap();
    let ip_re = Regex::new("<ip>([^<]+)</ip>").unwrap();
    let ipv4_re = Regex::new(r#"^\d+\.\d+\.\d+\.\d+$"#).unwrap();
    let ipv6_re = Regex::new("^[0-9A-F:]+").unwrap();

    let filename = "../frwiki-20210220-pages-articles-multistream1.xml-p1p306134";
    let fh = fs::File::open(filename).unwrap();
    let buf = BufReader::new(fh);

    let mut prev_contributor = false;
    let mut usernames = 0;
    let mut ipv4s = 0;
    let mut ipv6s = 0;

    for line in buf.lines() {
        let line = line.unwrap();
        if prev_contributor {
            if let Some(cap) = ip_re.captures(&line) {
                let ip = cap.get(1).unwrap().as_str();
                // is it ipv4 or ipv6?
                if let Some(capv4) = ipv4_re.captures(ip) {
                    ipv4s += 1;
                } else if let Some(capv6) = ipv6_re.captures(ip) {
                    ipv6s += 1;
                }
            } else if username_re.is_match(&line) {
                usernames += 1;
            }
            prev_contributor = false;
        } else {
            prev_contributor = contributor_re.is_match(&line);
        }
    }

    println!(
        "Found {} usernames, {} IPv4 addresses, {} IPv6 addresses",
        usernames, ipv4s, ipv6s
    );
}

When I ran this on a release build, I got my counts and baseline performance time:

$ time ./target/release/regex-rust
Found 154395 usernames, 8162 IPv4 addresses, 3884 IPv6 addresses

real    0m7.737s
user    0m7.163s
sys     0m0.497s

Next, I translated this to C# – specifically, a .NET Core 5 executable running on the same machine:

using System;
using System.IO;
using System.Text.RegularExpressions;

namespace ParseWikipedia
{
    class Program
    {
        static void Main(string[] args)
        {
            Regex contributor_re = new Regex("<contributor>"),
                 username_re = new Regex("<username>([^<]+)</username>"),
                 ip_re = new Regex("<ip>([^<]+)</ip>"),
                 ipv4_re = new Regex(@"^\d+\.\d+\.\d+\.\d+$"),
                 ipv6_re = new Regex("^[0-9A-F:]+");

            string filename = "../frwiki-20210220-pages-articles-multistream1.xml-p1p306134";

            bool prev_contributor = false;
            int usernames = 0,
                ipv4s = 0,
                ipv6s = 0;
            using (StreamReader sr = new StreamReader(filename))
            {
                while (!sr.EndOfStream)
                {
                    string line = sr.ReadLine();
                    if (prev_contributor)
                    {
                        Match m = ip_re.Match(line);
                        if (m.Success)
                        {
                            string ip = m.Groups[1].Value;

                            // is it ipv4 or ipv6?
                            m = ipv4_re.Match(ip);
                            if (m.Success)
                            {
                                ipv4s += 1;
                            }
                            else
                            {
                                m = ipv6_re.Match(ip);
                                if (m.Success)
                                {
                                    ipv6s += 1;
                                }
                            }
                        }
                        else if (username_re.IsMatch(line))
                        {
                            usernames += 1;
                        }
                        prev_contributor = false;
                    }
                    else
                    {
                        prev_contributor = contributor_re.IsMatch(line);
                    }
                }
            }

            Console.WriteLine($"Found {usernames} usernames, {ipv4s} IPv4 addresses, {ipv6s} IPv6 addresses");
        }
    }
}

Note that I’m mixing and matching use of Regex.Match and Regex.IsMatch to be artificially inefficient. But I’m doing the same thing in Rust and in C#. The C# code was slower:

$ time ./bin/Release/net5.0/regex-csharp
Found 154395 usernames, 8162 IPv4 addresses, 3884 IPv6 addresses

real    0m14.621s
user    0m13.552s
sys     0m0.945s

The C# code was nearly half the speed as Rust. This covers both the stream reading and the regular expressions, so I also tested both programs that just churn through the data counting lines: Rust’s user time was 5.094s versus C#’s 9.335s.

One last test was to add lto = "fat" and codegen-units = 1 to the Cargo.toml. This sped up Rust by about 15%:

$ time ./target/release/regex-rust
Found 154395 usernames, 8162 IPv4 addresses, 3884 IPv6 addresses

real    0m6.552s
user    0m6.092s
sys     0m0.460s

While not quite a scientifically rigorous test, I’m still impressed with the ease with which this simple test can be sped up by 2x.