Google Sheets – How to Categorize Data Using Regular Expressions

google sheetsregexextract

I'm trying to categorize some data in a Google spreadsheet.

  • I have "phrases to screen" in column A, TAB "Phrases". For example "blueish house"
  • I have "words" in column A, TAB "Words". For example, "blue"
  • I have "categories" in column B, TAB "Words". For example, "color"

I created a sample file here:
There's a link to a related post too that can help.

I'd like to use regular expressions using the words in column A to match some phrases in A, and then return the appropriate category from B.

For example, I'd like to find "blue" in "blueish house" and return "color".
I'd like to find "white" in "great white shoes" and return "color", but not when followed by numbers. Hence, I need to use regular expressions.

I'm using the following formula in column C TAB "phrases". It works, but not with regular expressions. It returns the reg. expressions itself and then can't match the category.

=arrayformula(vlookup(arrayformula(iferror(regexextract(A2:A8,join("|",Words!$A$1:$A$7)))),Words!A$1:B,2,0))

I tried to adapt the formula (from the related post) in column D but it's not working.

Best Answer

SO here is what I did - I added a sheet to your doc called SO Test - Aurielle.

Then I made a unique list of the possible categories in column B using:

=UNIQUE(Words!B:B)

In Column A, I did a JOIN using the regex AND operator which is | and used the formula:

=IF(ISTEXT(B2),JOIN("|",FILTER(Words!A:A,Words!B:B=B2)),)

Basically the filter restricts it to combine keywords by their category value.

Then in column D I added this formula:

=IFERROR(INDEX(A$2:B,MATCH(TRUE,ARRAYFORMULA(REGEXMATCH(C2,INDIRECT("A"&2&":A"&3+COUNTA(B$2:B)))),0),2))

Basically what happens here , is if you break it down from the inside out is I'm using ARRAYFORMULA along with REGEXMATCH - which returns true or false depending on which row the value actually exists in.

So I use the word TRUE to be my key for the MATCH formula, then using INDEX, I navigate it to pull in the index row, and one column over thus grabbing the category.

NOTE: I also added an additional INDIRECT formula in there to calculate how many values actually exist in column B so that your formula will dynamically accommodate on the number of rows it needs to..

enter image description here