Google Sheets – How to Determine the Xpath Query for IMPORTXML

google sheetsimportxmlxpath

I'm looking to retrieve the most recent financial figure for a single data point from a table using IMPORTXML.

The target page displays three sets of financial figures for a company on quarterly and annual bases. In this case, I only want the "Total Debt" figure listed from the balance sheet for whatever is the most recent fiscal period available.

I've used the following IMPORTXML formula:

=IMPORTXML("https://www.google.com/finance?q=GOOG&fstype=ii","//td[contains(.,'Total Debt')]")

This results in only the data label displayed vertically in two adjacent cells.

Total Debt
Total Debt

I've also used this IMPORTXML formula with the Xpath query determined using Chrome:

=IMPORTXML("https://www.google.com/finance?q=GOOG&fstype=ii","//*[@id='fs-table']/tbody/tr[27]/td[2]")

This results in four values (data from two financial statements for two periods each) displayed vertically, as below, in adjacent cells with the third value down being correct. Per this answer to another discussion, I've tried removing the "tbody" element node resulting in an "#N/A" error: "Imported content is empty."

4,207.00
15,826.00
5,208.00
5,220.00

For now, I'm trying to avoid using IMPORTHTML and INDEX for parsing a whole table since I need only a single value from the table.

How do I determine the Xpath query for this page to retrieve (1) the "Total Debt" figure for (2) always the most recent reporting period?

EDIT: Because there are two elements with same name, "Total Debt," I had also tried using the below formula with and without predicates (appending 1, 2, [last], etc. in square brackets) and got returned an error with empty content.

=IMPORTXML("https://www.google.com/finance?q=GOOG&fstype=ii","//*[local-name() = 'Total Debt'][1]")

Best Answer

Short answer

AFAIK, regarding XPath queries to be used with IMPORTXML there isn't straightforward method as XPath 1.0 support looks that was not fully implemented and the web pages developers could follow the practices to set the structure of their webpages.

Explanation

While the use of tools like Chrome Developer Tools or browser extensions/add-ons could be helpful sometimes these tools doesn't return a XPath query that could be used by IMPORTXML due to differences on how XPath support was implemented by the developers of each tool, by the other hand, web pages could comply or not with the XML rules, so to find the XPath query to be used with IMPORTXML could be necessary to analyze the structure of the source web page and to do several tries.

XPath queries for the use case

The below XPath queries returns 5,208.00

//div[@id="balinterimdiv"]//tr[contains(.,'Total Debt')]/td[2]

(//tr[contains(.,'Total Debt')]/td[2])[1]

Explanation

The referred page includes two views for the Balance Sheet: Quarterly Data and Annual Data. Both of them looks to have the same structure as both includes a table cell (td tag) with the text Total Debt. Fortunately, each view are inside a div tag and each of them have their own id, so in order to get only one, the first step in the XPath query could be to select the right view, then the second step could be to select the right table row (tr tag) and the third step to select the right table cell (td tag).

Another approach is to use the construct (xpath_query)[position() = 1] (see the reference).

References

Answer by Dimitre Novatchev to What is the XPath expression to find only the first occurrence? referred by Dale in a comment to another answer to the question in this thread.

Related Solutions

Google-sheets – In Google Spreadsheets the xPath method with Google Maps don’t work anymore

Getting data by parsing the HTML source of a page is a fragile method: when the pages gets redesign, the import is most likely going to break down because the data you want are not where you expect. It's no coincidence that most of examples on Google's own IMPORTXML documentation page are broken by now.

I recommend using Google Distance Matrix API instead (it is a subset of Google Maps API). It also accepts HTTP requests and can return data as XML; but this XML will have predictable, documented structure for the lifetime of the API version.

Examples of parsing Google's XML response with XPath are found here.

Google Sheets – ImportXML with XPath Not Returning Table Entry

It's because of <tbody>. There is no <tbody> in the HTML source. Browser puts it there (on its own; nothing to do with JavaScript on the page) because it believes it should be there, according to HTML spec.

Just remove /tbody from the path:

=IMPORTXML( "https://eresearch.fidelity.com/eresearch/evaluate/fundamentals/keyStatistics.jhtml?stockspage=keyStatistics&symbols=aapl" , "//div[@id='audit-integrity'][1]/table/tr[10]/td[2]")

And in general: when XPath doesn't work, debug by removing all of the selectors, then adding one by one until the formula breaks...

Best Answer

Short answer

Explanation

XPath queries for the use case

Explanation

References

Related Solutions

Google-sheets – In Google Spreadsheets the xPath method with Google Maps don’t work anymore

Google Sheets – ImportXML with XPath Not Returning Table Entry

Related Topic