Httpd – how to rewrite ‘%25’ in url

httpdhttpd.confmod-rewrite

My website software replaces space characters with '+' characters in the URL, A proper link would look like 'http://www.schirmacher.de/display/INFO/How+to+reattach+a+disk+to+XenServer' for example.

Some websites link to that article but somehow their embedded editor can't handle the encoding, so what I see in the httpd log files is actually

GET /display/INFO/How%2525252bto%2525252breattach%2525252ba%2525252bdisk%2525252bto%2525252bXenServer

which of course leads to a 404 error. It seems that the '+' character is encoded as '%2b' and then the '%' character is encoded as '%25' – several times.

Since there are many such references to different pages from different websites, I would like to rewrite the url so that the visitors get the correct page.

Here's my attempt which does not work:

RewriteRule ^(.*)%25(.*)$ $1%$2 [R=301]

What it is supposed to do is: take everything before the %25 string and everything after it, concat those strings with a '%' in between, then redirect.

With the example input URL the rule should rewrite to

/display/INFO/How%25252bto%2525252breattach%2525252ba%2525252bdisk%2525252bto%2525252bXenServer

followed by a redirect, then it should rewrite to

/display/INFO/How%252bto%2525252breattach%2525252ba%2525252bdisk%2525252bto%2525252bXenServer

and again to

/display/INFO/How%2bto%2525252breattach%2525252ba%2525252bdisk%2525252bto%2525252bXenServer

and so on. Finally, after a lot of redirects I should have left

/display/INFO/How%2bto%2breattach%2ba%2bdisk%2bto%2bXenServer

which is a valid url equivalent to /display/INFO/How+to+reattach+a+disk+to+XenServer.

My problem is that the expression does not match at all, so it does not even replace a single occurrence of %25.

I understand that there is a limit in the number of redirects and I should really use the [N] flag however I don't even get the first step right.


@Ben Lee: thanks for your detailed answer. I have now spent several hours on that problem. Here's what I have found out:

  1. Any '%25' string in the url is converted to '%' before mod_rewrite
    sees it. So the RewriteRule ^(.)%25(.)$ does not match '%25' in
    the url, it actually matches '%2525'.

  2. The presence of a backslash does not make a difference.
    It seems that the '%' sign is not interpreted as a backreference in my case,
    perhaps because there is no RewriteCond statement before. But it is probably
    better to use it, just to be sure.

  3. The line having [L,R=301] is incorrect. It will attempt to redirect for every %2b match but there is a limit of allowed redirects and it will fail if there are more.

Here are the mod_rewrite lines I am using:

RewriteRule ^(.*)\%25(.*\%25.*)$ $1%$2 [N]
RewriteRule ^(.*)\%25(.*)$ $1%$2 [R=301,L]

RewriteRule ^(.*)\%2b(.*\%2b.*)$ $1+$2 [N]
RewriteRule ^(.*)\%2b(.*)$ $1+$2 [R=301,L]

The third line will replace all but one %2b sequences with a '+' character. When there is only one %2b sequence left, the fourth line will match, forcing a redirect.

The first and second line are basically the same but with a %25 sequence. It is necessary to have a rule with an [R] flag for each possible character sequence because I am also using mod_proxy / mod_jk and the redirect will make sure that the resulting url is fed to each module again. Otherwise httpd would attempt to fetch the url from disk which would fail in my case.

Best Answer

Here's your original rule, with [L] added to denote "last":

RewriteRule ^(.*)%25(.*)$ $1%$2 [L,R=301]

After that there are a few problems here. First, percent signs in RewriteRule patterns have a special meaning; they denote the beginning of a back reference to a RewriteCond. You can get around this by escaping them (using a backslash):

RewriteRule ^(.*)\%25(.*)$ $1%$2 [L,R=301]

Second, when you insert a % in to the replacement, it doesn't then go on to treat that as part of a uri-encoded piece. It translates to a literal percent sign. In the original url you are receiving, the first %25"is converted into a literal percent sign as well. So the above rule will result in literal %25s or a literal %2b in the url instead of resolving to % or +. So you have to manually resolve these yourself.

RewriteRule ^(.*)\%25(.*)$ $1%$2
RewriteRule ^(.*)\%2b(.*)$ $1+$2 [L,R=301]

Finally, since you don't just have a single 25 after the initial %, but potentially many, use [N] to denote "next". This basically means "start the process over from the beginning, but use my new url as the input". So this will deal with any number of 25s after the percent:

RewriteRule ^(.*)\%25(.*)$ $1%$2 [N]
RewriteRule ^(.*)\%2b(.*)$ $1+$2 [L,R=301]

Note: This should work if you are setting up your rule in the regular apache configs. If you are setting it up as an .htaccess, leading slashes are omitted from the string checked against the regex, in which case you have to add them back in yourself:

RewriteRule ^(.*)\%25(.*)$ /$1%$2 [N]
RewriteRule ^(.*)\%2b(.*)$ /$1+$2 [L,R=301]

UPDATE: I don't have the ability to test right now, but looking at the docs, I just saw an option NE for "no escape" that makes percents work as regular encoding markers in the result. If I understand correctly, that means the rule can be simplified to this:

RewriteRule ^(.*)\%25(.*)$ $1%$2 [NE,N,L,R=301]

But again, this is untested, and I've never actually used the NE flag so I may be misunderstanding it. If you test this and find that it works, let me know and I'll remove this UPDATE and just fix the above answer to include this simpler version.

Related Topic