Bash – Regex negative matching trouble

bashregex

I've been frustrated trying to come up with a regex to match strings based on specific file names and am hoping there's a regex ninja (I'll omit the obligatory xkcd link for the sake of time) out there who can help.

I need to match any string ending with ".htm" or ".html" that is NOT (negative matching) preceded immediately by "msg-" followed by 4-16 digits of numbers or hyphens. The start of the string can be any length or content.

Here's my attempt so far:

(?!msg-[0-9-]{4,16})\.html?$

However, this doesn't seem to work. Part of the problem is lookahead matching — I want to match the whole string if it meets these criteria rather than the first part of the string that doesn't match. Any suggestions would be appreciated.

In case it matters for flavors, this is going into a bash script on Debian.

EDIT:

Here are some strings that should match the regex

the-quick-brown-fox-jumped-over-the-lazy-dog.html  # ends with .html but no digits/hyphens just prior
wdihwi94uq239ujdf23yefh02msg-2-8.htm   # digit/hyphen count between 'msg-' and '.html' is below 4
ohdf23890yo4c89uwmsg-999-24j345.html   # non-number/hyphen in chars between 'msg-' and '.html'

Here are some strings that should NOT match the regex:

kh3j42he2-dwfascn233=feufefask0msg-34535-355  # does not end with '.htm'/'.html'
395-u78{efihighqwioh9msg-8455-212.html  # ends with 'msg-' then 4-16 of [0-9-] then '.html'
dfhjwih9asnm)qpzmx.wod923klsj39msg-00-0000.htm

Best Answer

I think the following Perl regexp matches what you want:

(?!.*msg-[-0-9]{4,16}\.html?$).*\.html?$

However AFAIK there isn't any place where bash supports Perl regexps. The =~ operator only supports extended regexps¹, which don't include zero-width lookahead assertions such as (?=…) and (?!…).

It is theoretically possible to convert a regexp with lookahaed assertions to one without, but the resulting regexp would be huge. It is much simpler to use two regexps:

[[ $string =~ \.html?$ && ! $string =~ msg-[-0-9]{4,16}\.html?$ ]]

¹ First there were basic regexps (BRE) (with several syntax variants), then came extended regexps (ERE) with more features (and again several syntax variants). Perl added yet more features, and many languages provide perl-compatible regexps (pcre). But bash sticks to ERE.