Google-sheets – In a Google spreadsheet, how to list unique items, modified by a regexp

formulasgoogle sheetsregexextract

I have a Google spreadsheet with items and prices. Column D has the prices, column C has item names, some of which contain multiple words. When an item has multiple words, only the first one is really significant.

What I'd like to achieve is this:

  1. Process the item list to strip off anything after the non-alpha character,
  2. SORT(UNIQUE()) the results,
  3. SUM() all the items that were matched to create this new list.

I will then generate a pie chart (this is food, after all) with the results. So for example, I might have:

  1    C             | D
  -------------------+------------
  2    Cheese        | $9.08
  3    Milk, 2L      | $3.89
  4    Avocado       | $0.99
  5    Avocado 3/$5  | $5.00
  6    Apples Fuji   | $2.75
  7    Avocado       | $1.49
  8    Milk 4L       | $4.75
  9    Apples        | $2.25

The end result will be:

  Cheese   | $9.08
  Milk     | $8.64
  Avocado  | $7.48
  Apples   | $5.00

I know that I can use REGEXEXTRACT() to parse an individual field:

=REGEXEXTRACT(C3; "^([A-Za-z]+)")

returns Milk, as I would expect. As well, I know that I can add up totals for a column using FILTER():

  L                   | M
  --------------------+----------
  =UNIQUE( C2:C )     | =SUM( FILTER( D$2:D; L2=C$2:C ) )
  =CONTINUE(C2, 2, 1) | =SUM( FILTER( D$2:D; L3=C$2:C ) )
  ...

But I want to combine these, and add up the totals of the items as modified by REGEXEXTRACT(). I was hoping I'd be able to do something like:

=UNIQUE( REGEXEXTRACT( C2:C; "^([A-Za-z]+)" ) )

and

=SUM( FILTER( D$2:D; L2=REGEXEXTRACT(C$2:C, "^([A-Za-z]+)") ) )

But when presented with a range, the REGEXEXTRACT() function delivers just a string, not an array. This is confirmed, because when I try the formula:

=REGEXEXTRACT(UNIQUE(C2:C); "^([A-Za-z]+)")

the error I get is "Not a string: Cheese".

So my question is … How do I make this work?

Do I need to make a "scratch" worksheet as a bridge? Is it possible to REGEXEXTRACT() each element of a UNIQUE() result set within a formula?

Best Answer

My approach would be to "normalise" the data first:

=ArrayFormula(IF({1,0};IFERROR(REGEXEXTRACT(A2:A;"^([a-zA-Z]+)"));B2:B))

and then you can actually wrap this directly in a QUERY function to produce the desired output:

=ArrayFormula(QUERY(IF({1,0};IFERROR(REGEXEXTRACT(A2:A;"^([a-zA-Z]+)"));B2:B);"select Col1, sum(Col2) where Col1 != '' group by Col1 label sum(Col2) ''";0))

You can even output a header row as well:

=ArrayFormula(QUERY(IF({1,0};IFERROR(REGEXEXTRACT(A2:A;"^([a-zA-Z]+)"));B2:B);"select Col1, sum(Col2) where Col1 != '' group by Col1 label Col1 'Food Item', sum(Col2) 'Total'";0))