Google Sheets – How to Extract Multiple Paths from URLs

google sheetsgoogle-sheets-arrayformularegexregexextracturl

How can I get a subrange of second to last element in a formula under Google Sheets?

I'm trying to extract all occurrences of https://example.com/* in a cell which may contain multiple URLs scattered across a string like:

abc https://example.com/abc https://google.com/ghi https://example.com/def/abc

so that I get the paths in a single string, separated by spaces as follows:

abc def/abc

As SPLIT looks the easiest way for that, I need to get rid of the first item in a formula like:

=join("
",
  arrayformula(
    regexextract(
      split(A5, "https://example.com/", false, true),
      "[a-zA-Z0-9/\?=_]+")
    )
  )

Best Answer

You have the basic concept right just piece it together step by step. This does the job:

=REGEXREPLACE(JOIN(" ",ARRAYFORMULA(IFERROR(
  REGEXEXTRACT(SPLIT(A1," ",TRUE,TRUE),
  "https://example.com/(\S+)"),""))),"\s+"," ")
  1. You split the string at whitespaces, this is safe because URLs don't contain white space so you're definitely not breaking them up. This gives you an array of strings
  2. Since you want all the paths for all the domains starting with example.com you REGEXEXTRACT exactly those paths and only those paths using a capturing group (\S+).
  3. Now you will get an error for every string that doesn't have the correct domain. So you use IFERROR to simply replace those strings with empty strings, we don't care about those.
  4. We wrap all of this in ARRAYFORMULA since the output of split is an array. So now we have an array of empty strings and paths.
  5. We now JOIN them together using white space, again, paths don't contain white space so we're not losing any structure.
  6. Lastly, we're cleaning up any white space tuples since they're unnecessary and not pretty using REGEXREPLACE. Done!

The result is a string of all the paths for the correct domain only, separated by one whitespace each.

Tip: I built this formula step by step, not as one big formula but piecewise in exactly the sequence outlined above to keep things manageable.

enter image description here