R – Why would Textpad ask if you want to use POSIX regular expression syntax

posixregextextpadwindows

I need to separate out a bunch of image urls from a document in which the images are associated with names like this:

bellpepper = "http://images.com/bellpepper.jpg"
cabbage = "http://images.com/cabbage.jpg"
lettuce = "http://images.com/lettuce.jpg"
pumpkin = "http://images.com/pumpkin.jpg"

I want to remove all text except the URLs from the file by deleting the variable name, equals sign and double quotes so I have a new file that is just a list of URLs, one per line.

I've tried various ways of identifying the non-URL data using regular expressions in Textpad by checking the "Regular expression" checkbox in the Find dialog window but Textpad doesn't seem to like any of them.

Under

Configure->Preferences->Editor

there's an option:

"Use POSIX regular expression syntax"

As opposed to what?

Is it possible that my problems performing this regex operation have to do with some quirk of Textpad's implementation of regex?

Best Answer

The POSIX alternative is as opposed to the TextPad default. From the Search/Replace help doc:

TextPad's regular expressions are based on POSIX standard P1003.2, but the syntax can be that of POSIX, or UNIX extended regular expressions (the default).

to get the job done in TextPad, use the following:

Find in: ^[^"]*"\([^"]*\)"
Replace with: \1

edit:

to break the expression down:

^ - start of line
[^"]* - in a set the caret ^ is for negation, 
        so a greedy match of anything that is not a "
        in this case, everything up to the first quote
" - the first quote per line in your source text
\(...\) - puts together a group that can be referenced later
[^"]* - same explanation as above, this time matching the url in question
" - the last quote on the line

Also, looking through the help doc on Regex in TextPad, there is a chart of legal expressions listing both the 'Default' and the 'POSIX' versions side by side. The only difference seems to be the escaping of the Grouping parens () and the Occurance curlies {} in the Default and the lack of escaping in the POSIX version.

With that in mind, to get the job done in TextPad with the 'use POSIX regular expression syntax' option checked, swap out the above 'Find in' expression with the following:

Find in: ^[^"]*"([^"]*)"

Edit: 2019-09-10

As you can see the way to iterate over multiple matches was not very intuitive. This lead to the proposal of the String.prototype.matchAll method. This new method is expected to ship in the ECMAScript 2020 specification. It gives us a clean API and solves multiple problems. It has been started to land on major browsers and JS engines as Chrome 73+ / Node 12+ and Firefox 67+.

The method returns an iterator and is used as follows:

const string = "something format_abc";
const regexp = /(?:^|\s)format_(.*?)(?:\s|$)/g;
const matches = string.matchAll(regexp);
    
for (const match of matches) {
  console.log(match);
  console.log(match.index)
}

As it returns an iterator, we can say it's lazy, this is useful when handling particularly large numbers of capturing groups, or very large strings. But if you need, the result can be easily transformed into an Array by using the spread syntax or the Array.from method:

function getFirstGroup(regexp, str) {
  const array = [...str.matchAll(regexp)];
  return array.map(m => m[1]);
}

// or:
function getFirstGroup(regexp, str) {
  return Array.from(str.matchAll(regexp), m => m[1]);
}

In the meantime, while this proposal gets more wide support, you can use the official shim package.

Also, the internal workings of the method are simple. An equivalent implementation using a generator function would be as follows:

function* matchAll(str, regexp) {
  const flags = regexp.global ? regexp.flags : regexp.flags + "g";
  const re = new RegExp(regexp, flags);
  let match;
  while (match = re.exec(str)) {
    yield match;
  }
}

A copy of the original regexp is created; this is to avoid side-effects due to the mutation of the lastIndex property when going through the multple matches.

Also, we need to ensure the regexp has the global flag to avoid an infinite loop.

I'm also happy to see that even this StackOverflow question was referenced in the discussions of the proposal.

Javascript – How to use a variable in a regular expression

Instead of using the /regex\d/g syntax, you can construct a new RegExp object:

var replace = "regex\\d";
var re = new RegExp(replace,"g");

You can dynamically create regex objects this way. Then you will do:

"mystring1".replace(re, "newstring");

Best Answer

Related Solutions

Javascript – How to access the matched groups in a JavaScript regular expression

Edit: 2019-09-10

Javascript – How to use a variable in a regular expression

Related Topic