Linux – HAproxy is giving me problems with regex replace, Is this a bug or am I doing something incorrect

haproxylinuxregexrewrite

I am attempting to correct a URL parameter issue by forcing a URL encode on a node of a POST path that is somewhat a frequent occurrence. It seems best, at this time, to fix this at the proxy layer until a better solution has be developed.. but Haproxy is giving me problems with this, I should also mention that I am stuck with Haproxy v1.5 at the moment (which, from what I can tell, also leaves using Lua out of the list of options ..introduced in v1.6?).

An example of this is goes like this..

I get a POST request quite typically in the form like this..

http(s)://sub.domain.com/context/{context}/staticPath/location/{location}/material/{material} 

So, it may look more like this in practice..

http://sub.domain.com/context/smith/staticePath/location/columbus/material/abc/123

Needing the following out the other end..

http://sub.domain.com/context/smith/staticePath/location/columbus/material/abc%2F123

The problem is that abc/123 is a single material that needs to look more like "abc%2F123", where the '/' slash is changing the actual path.

I am attempting to catch this in the proxy, I can get regex to capture what I need but it seems that whenever I try to have a "slash" '/' in a capture group and/or attempt to put a slash back into the replace section it breaks the rewrite.

Here are examples of what I have tried, also keep in mind that I intend to expand the capturing to grab the entire url, but was simplifying to try to work these out, also I am attempting to tell some of the story from memory at this point, so please forgive if the below is not dead on.. I have tried many, many combinations trying to come up with a strategy that would work.

In this way..

reqrep (\w+\s?)\/(material)\/(\w+\s?)\/(.*) \1\2\3%2f\4

I can get the capture groups to put the url back together again, but without the path delimeters ("/") between the path nodes.

Like this, it does not replace, it will just send the original path.

reqrep (\w+\s?)\/(material)\/(\w+\s?)\/(.*) \1\/\2\/\3%2f\4

Taking a strategy like this…

reqrep (\w+\s?)(\/)(material)(\/)(\w+\s?)\/(.*) \1\2\3\4\5%2f\6

One other strategy that I tried is keeping the "/"'s in the capturing groups so they may come out in the replace, leaving the un-desired "slash" not in a capture group, similar to below..

reqrep (\w+\s?)(\/material\/)(\w+\s?)\/(.*) \1\2\3%2f\4

I have also read about, and seen examples where some of the regex has spaces and the replace has some spacing.. I can get close by using some spacing in the replace, but that leaves undesired spaces in the end result.

also..

If I escape a space, then add a slash it seems to work closer..
ex. \1\ /\2 but then I would get something like (for instance) location /material.. adding the space like mentioned above.

The pattern that I am noticing is that when I attempt to add the slashes to a capture group in the regex it messes up the replace making me guess wildly about things like.. are the slashes not getting escaped because they are in the capture group? And, why can't I just put them back in the replace as literals? This is the point where I imagine I may have stumbled upon a bug.. but also am aware that I can be screwing this up. A solution has been developed using Nginx, but standing an instance of that in front of what we need is also not the most practical if I can get Haproxy to do this, mostly because we are already using Haproxy to do quite a bit of other stuff already.

I honestly prefer to address this issue in another way, but for now using the proxy seems to be one of my best choices. I also don't have the luxury of forcing the originator to give be better paths.

Best Answer

This suggestion I made in comments appears to do almost the right thing:

reqrep ^([^\ :]+)(\ ?/.+/material/)(.+)/(.+)(\ .+)$ \1\2\3\4%2f\5

In fact I put \4 on the wrong side of %2f. I also incorrectly made the space at the beginning of the second capture group optional, which doesn't break the regex but is not techically correct.

This is the correct form:

reqrep ^([^\ :]+)(\ /.+/material/)(.+)/(.+)(\ .+)$ \1\2\3%2f\4\5

That's the problem with reqrep -- you're tweaking the first line of the HTTP request, directly. Powerful, but tedious.

Breaking this down:

^ Always anchor your pattern to the beginning of the line.

([^\ :]+) This is the HTTP verb (GET, POST, etc.). It must contain no spaces, no colon. This is capture group 1.

(\ /.+/material/) The verb must be followed by a space, the leading slash (forward slashes do not need a backslash escape in HAProxy regexes), one or more characters, then /material/ ... this is capture group 2.

(.+) The first part of what we want to split at a / is capture group 3... and really, this would be more correctly written ([^/]+) though most potential mismatches are prevented by the space we require in group 5, below.

/ the slash we want to eliminate

(.+) The portion of the URL after the / is capture group 4

(\ .+) a space, followed by 1 or more characters, which is going to capture HTTP/1.x at the end of the request line as capture group 5.

$ anchored to the end of the line.

Then put them all back together.

\1\2\3%2f\4\5


HAProxy 1.6 handles this more elegantly with the built-in Lua interpreter as well as a converter called regsub() (though it is very simple -- substitutions only, no capture groups, but it's good for splitting strings) and user-defined variables where you can "stash" little data nuggets while processing the request. It also allows you to use http-request set-path and has a path fetch to read and write the path in isolation from the rest of the URL and without tweaking the HTTP request buffer directly with a regex. Most or all of these things are not in 1.5.