(almost) no tech hacking

hacking your way through life with common sense and a little bit of python

Discovering Regular Expressions prone to Denial of Service attacks

I recently came across some article from Bryan Sullivan describing a new denial of service vector targeting regex-based input validation filters. The concept is simple: some regex patterns processing time becomes O(n^2) when confronted with strings that do not match.

http://www.flickr.com/photos/anavrin/194771480/sizes/z/in/photostream/

In this situation, one can submit to the validation engine some specially crafted invalid string and consume huge amounts of machine processing time, at least until external timeouts (such as web server response generation times) kill the busy process.

Take a look at the following regex pattern:

^(\w+\d+)+a$

Innocent, right? Well, if you try to confront it with the following script:

#!/usr/bin/python
import re
import sys
s = re.compile(sys.argv[1])
a = s.match(sys.argv[2])

and let’s see how well it behaves under some specially crafted input:

$ time ./test_regex.py “(\w+\d+)+a” “999999999999999999999.”

real 0m0.304s

user 0m0.290s

sys 0m0.020s

$ time ./test_regex.py “(\w+\d+)+a” “9999999999999999999999.”

real 0m0.576s

user 0m0.560s

sys 0m0.010s

$ time ./test_regex.py “(\w+\d+)+a” “99999999999999999999999.”

real 0m1.108s

user 0m1.100s

sys 0m0.010s


As you can see, due to the backtracing behaviour of the Python regex engine (of Non-Deterministic Finite Automation type) each additional character we add to our invalid expression effectively doubles the processing time. As Bryan explains in his article, when the engine handles a string which does not show itself as invalid until its very end, processing time of backtracing for self-repeating capture groups grows exponentially.


So, any regex pattern containing self-repeating groups, or group repetition of overlapping subexpressions can be prone to suffer from denial of service issues under certain inputs. Considering regex patterns usually are complex to decode at a glance and tend to become some kind of ciphered text for the naked-eye, it is not trivial to identify these dangerous structures on any given regex.

This is where Python, once again, comes to our rescue.

Python Regular Expression module provides one nifty, yet usually unknown feature to debug our patterns printing the parse tree. By passing the undocumented, experimental, hidden flag re.DEBUG (actually, 128) to re.compile we obtain something like this:

>>> import re
>>> re.compile('(\w+\d+)+a', re.DEBUG)
max_repeat 1 65535
  subpattern 1
    max_repeat 1 65535
      in
        category category_word
    max_repeat 1 65535
      in
        category category_digit
literal 97
<_sre.SRE_Pattern object at 0x7f72c27287b0>

How can we use this feature to find dangerous structures inside our regex filters? Well, if you look closely at the previous output, you can see that there are two max_repeat items nested inside another max_repeat occurrence. If we treat these items as programatic loops, nested constructions can give away problems like the exponential backtracing we discussed earlier.

Consider the following regex used to validate email addresses:

^([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*@(([0-9a-zA-Z])+([-\w]*[0-9a-zA-Z])*.)+[a-zA-Z]{2,9})$

Daunting at first glance. But let’s shove it to our newly discovered Python regex parser:

$ python -c "import re; re.compile('^([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*@(([0-9a-zA-Z])+([-\w]*[0-9a-zA-Z])*\.)+[a-zA-Z]{2,9})$', re.DEBUG)"
at at_beginning
subpattern 1
  in
    range (48, 57)
    range (97, 122)
    range (65, 90)
  max_repeat 0 65535
    subpattern 2
      max_repeat 0 65535
        in
          literal 45
          literal 46
          category category_word
      in
        range (48, 57)
        range (97, 122)
        range (65, 90)
  literal 64
  max_repeat 1 65535
    subpattern 3
      max_repeat 1 65535
        subpattern 4
          in
            range (48, 57)
            range (97, 122)
            range (65, 90)
      max_repeat 0 65535
        subpattern 5
          max_repeat 0 65535
            in
              literal 45
              category category_word
          in
            range (48, 57)
            range (97, 122)
            range (65, 90)
      literal 46
  max_repeat 2 9
    in
      range (97, 122)
      range (65, 90)
at at_end


Pay attention to the items highlighted. We have two nested max_repeat constructions with an overlapping range of characters (range 48-57) inside another max_repeat construction. This is a clear example of another vulnerable regex pattern, but it wasn’t so clear when we looked at the raw regex was it?

As I said before, regex based denial of service vulnerabilities are usually part of scripts operating within the context of another application which may limit their processing time, and thus effectively avoiding the regex validation to last forever. However, when coupled with other attack vectors such as a simple request parallelization, the impact on application performance can become important and should be taken into consideration.

In the end, this is just another vector to use in the so called Application DoS Attacks.

  1. olemoudi posted this