Discovering Regular Expressions prone to Denial of Service attacks
I recently came across some article from Bryan Sullivan describing a new denial of service vector targeting regex-based input validation filters. The concept is simple: some regex patterns processing time becomes O(n^2) when confronted with strings that do not match.

In this situation, one can submit to the validation engine some specially crafted invalid string and consume huge amounts of machine processing time, at least until external timeouts (such as web server response generation times) kill the busy process.
Take a look at the following regex pattern:
^(\w+\d+)+a$
Innocent, right? Well, if you try to confront it with the following script:
#!/usr/bin/python import re import sys s = re.compile(sys.argv[1]) a = s.match(sys.argv[2])
and let’s see how well it behaves under some specially crafted input:
$ time ./test_regex.py “(\w+\d+)+a” “999999999999999999999.”
real 0m0.304s
user 0m0.290s
sys 0m0.020s
$ time ./test_regex.py “(\w+\d+)+a” “9999999999999999999999.”
real 0m0.576s
user 0m0.560s
sys 0m0.010s
$ time ./test_regex.py “(\w+\d+)+a” “99999999999999999999999.”
real 0m1.108s
user 0m1.100s
sys 0m0.010s
As you can see, due to the backtracing behaviour of the Python regex engine (of Non-Deterministic Finite Automation type) each additional character we add to our invalid expression effectively doubles the processing time. As Bryan explains in his article, when the engine handles a string which does not show itself as invalid until its very end, processing time of backtracing for self-repeating capture groups grows exponentially.
So, any regex pattern containing self-repeating groups, or group repetition of overlapping subexpressions can be prone to suffer from denial of service issues under certain inputs. Considering regex patterns usually are complex to decode at a glance and tend to become some kind of ciphered text for the naked-eye, it is not trivial to identify these dangerous structures on any given regex.
This is where Python, once again, comes to our rescue.
Python Regular Expression module provides one nifty, yet usually unknown feature to debug our patterns printing the parse tree. By passing the undocumented, experimental, hidden flag re.DEBUG (actually, 128) to re.compile we obtain something like this:
>>> import re
>>> re.compile('(\w+\d+)+a', re.DEBUG)
max_repeat 1 65535
subpattern 1
max_repeat 1 65535
in
category category_word
max_repeat 1 65535
in
category category_digit
literal 97
<_sre.SRE_Pattern object at 0x7f72c27287b0>
How can we use this feature to find dangerous structures inside our regex filters? Well, if you look closely at the previous output, you can see that there are two max_repeat items nested inside another max_repeat occurrence. If we treat these items as programatic loops, nested constructions can give away problems like the exponential backtracing we discussed earlier.
Consider the following regex used to validate email addresses:
^([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*@(([0-9a-zA-Z])+([-\w]*[0-9a-zA-Z])*.)+[a-zA-Z]{2,9})$
Daunting at first glance. But let’s shove it to our newly discovered Python regex parser:
$ python -c "import re; re.compile('^([0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*@(([0-9a-zA-Z])+([-\w]*[0-9a-zA-Z])*\.)+[a-zA-Z]{2,9})$', re.DEBUG)"
at at_beginning
subpattern 1
in
range (48, 57)
range (97, 122)
range (65, 90)
max_repeat 0 65535
subpattern 2
max_repeat 0 65535
in
literal 45
literal 46
category category_word
in
range (48, 57)
range (97, 122)
range (65, 90)
literal 64
max_repeat 1 65535
subpattern 3
max_repeat 1 65535
subpattern 4
in
range (48, 57)
range (97, 122)
range (65, 90)
max_repeat 0 65535
subpattern 5
max_repeat 0 65535
in
literal 45
category category_word
in
range (48, 57)
range (97, 122)
range (65, 90)
literal 46
max_repeat 2 9
in
range (97, 122)
range (65, 90)
at at_end
Pay attention to the items highlighted. We have two nested max_repeat constructions with an overlapping range of characters (range 48-57) inside another max_repeat construction. This is a clear example of another vulnerable regex pattern, but it wasn’t so clear when we looked at the raw regex was it?
As I said before, regex based denial of service vulnerabilities are usually part of scripts operating within the context of another application which may limit their processing time, and thus effectively avoiding the regex validation to last forever. However, when coupled with other attack vectors such as a simple request parallelization, the impact on application performance can become important and should be taken into consideration.
In the end, this is just another vector to use in the so called Application DoS Attacks.
-
lesitedupharmacien likes this
-
olemoudi posted this