Google Sheets – Extracting GICS Codes from Fidelity Website

google sheetsimportdataimporthtmlimportxmlxpath

This is the website:

https://eresearch.fidelity.com/eresearch/goto/evaluate/snapshot.jhtml?symbols=aapl&type=o-NavBar

I'm trying to pull out the following two pieces of data into a Google Sheet:

  • Sector (GICSĀ®)
  • Industry (GICSĀ®)

That is, I want the following two to show up for the above:

  • Information Technology
  • Technology Hardware, Storage & Peripherals

I've tried the usual techniques including:

  • importData
  • importHTML
  • importXML (this gave an error).

The XPath's I've derived via Google's Inspect Element tool are:

  • //*[@id="companyProfile"]/div[8]/span/a
  • //*[@id="companyProfile"]/div[13]/span/a

Nothing has worked so far. How can I extract this data into a Google Sheet?

Best Answer

The command IMPORTXML ignores the nodes with no text content: for example, taking //div[3] from the document where the body consists of

<div>First</div> <div>Second</div> <div></div> <div>Fourth</div>

results in "Fourth". So, when you count the <div> elements, skip over those where there is no text. The elements you are looking for are returned with

=IMPORTXML( url , "//div[@id='companyProfile']/div[4]/span")

and

=IMPORTXML( url , "//div[@id='companyProfile']/div[5]/span")

For a more robust solution, I advise not relying on the numbering of elements at all. The following command returns both values you want with a single call, one under another:

=IMPORTXML( url , "//div[@id='companyProfile']/div[@class='sub-heading']/span")

You can apply TRANSPOSE to the result to put them side by side, if preferable. Or, if some custom positioning is needed, get one at a time with

=IMPORTXML( url , "//div[@id='companyProfile']/div[@class='sub-heading'][1]/span")

and

=IMPORTXML( url , "//div[@id='companyProfile']/div[@class='sub-heading'][2]/span")

Note that one should use single quotes in an XPath command, since it's a string surrounded by double quotes.


Using Inspect Element tool is not an inherently bad idea (it shows a nice tree view of the document), but there is an important caveat: this tool shows the document after any JavaScript runs on the page, while IMPORTXML gets the source as it is before any JavaScript processing. This matters when some elements get added by a script (example in my answer here). To see exactly what IMPORTXML works with, use right-click -> "View Page Source" in Chrome, or its equivalent in other browsers.