Are Regexes Fast?
=================

Regexes are certainly convenient. Are they faster than the same Python code?

>>> import sys
>>> print(sys.version)
3.7.3 (default, Apr  3 2019, 05:39:12)
[GCC 8.3.0]

Consider the following regex:

>>> NOTLOWER = re.compile('[^a-z]')
>>> re_split_lower = lambda s: re.sub(NOTLOWER, '', s)

It is a lot nicer to write it as the equivalent python code:

>>> split_lower = lambda s: ''.join(c for c in s if c.islower())

I benchmarked this for the strings I am working with:

>>> from timeit import timeit
>>> timeit('s("AbcCdeFfEqFefqEFE")', globals={'s': re_split_lower})
1.7502129878848791
>>> timeit('s("AbcCdeFfEqFefqEFE")', globals={'s': split_lower})
1.0115699651651084

In this case it seems regex is significantly slower.  What if we make the regex
slightly more complicated?  Let's try to split a camel case string into its
individual words, i.e.:

    helloWorldCruel     ===> hello, World, Cruel
    helloWorld          ===> hello, World
    hello               ===> hello

>>> CAMELCASE = re.compile('^[a-z]+|[A-Z][a-z]*')
>>> re_decamelcase = lambda s: re.findall(CAMELCASE, s)

We can write this as a not-quite-equivalent function:

>>> def decamelcase(s):
...     prev_idx = 0
...     for i in range(len(s)):
...         if s[i].isupper():
...             yield s[prev_idx:i]
...             prev_idx = i
...     yield s[prev_idx:]

For small strings, it seems the regex is still slower:

>>> timeit('t("helloWorldHowAre")', globals={'t': re_decamelcase})
1.349484241567552
>>> timeit('list(t("helloWorldHowAre"))', globals={'t': decamelcase})
1.2007813868112862


But for larger strings, the regex is faster:

>>> timeit('t("hello" + "World"*200)', globals={'t': re_decamelcase})
24.34574169665575
>>> timeit('list(t("hello" + "World"*200))', globals={'t': decamelcase})
63.48653336800635

It would be intereting to also see if this holds for other VMs like Pypy, or if
we can rewrite the decamelcase function to be quicker than the regex
regardless.