Google Sheets – How to Use IMPORTDATA, IMPORTFEED, IMPORTHTML, and IMPORTXML Functions

google sheetsimportdataimportfeedimporthtmlimportxml

Google Sheets has several import functions

IMPORTDATA
IMPORTFEED
IMPORTHTML
IMPORTXML

Sometimes the above functions returns errors like #N/A Imported content is empty and would like to be sure that there isn't a problem with the resource content to be imported.

How do I know if these functions are able to get the content that I looking to import?

I know that there also exist IMPORTRANGE but that function only is able to import content from Google Sheets spreadsheets

Best Answer

Tl;Dr

If the content is added dynamically (by using Javascript), it can't be imported by using Google Sheets built-in functions. Also if the website webmaster have taken certain measures, this functions will not able to import the data.

Content added dynamically

To check if the content is added dynamically, using Chrome,

Open the URL of the source data.
Press F12 to open Chrome Developer Tools
Press Control+Shift+P to open the Command Menu.
Start typing javascript, select Disable JavaScript, and then press Enter to run the command. JavaScript is now disabled.

JavaScript will remain disabled in this tab so long as you have DevTools open.

Reload the page to see if the content that you want to import is shown, if it's shown it could be imported by using Google Sheets built-in functions, otherwise it's not possible but might be possible by using other means for doing web scraping.

According to Wikipedia, Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.

Use of robots.txt to block webcrawlers

The webmasters could use robots.txt file to block access to website. In such case the result will be #N/A Could not fetch url.

Use of User agent

The webpage could be designed to return a special a custom message instead of the data.

IMPORTDATA, IMPORTFEED, IMPORTHTML and IMPORTXML are able to get content from resources hosted on websites that are:

Publicly available. This means that the resource doesn't require authorization / to be logged in into any service to access it.
The content is "static". This mean that if you open the resource using the view source code option of modern web browsers it will be displayed as plain text.
- NOTE: The Chrome's Inspect tool shows the parsed DOM; in other works the actual structure/content of the web page which could be dynamically modified by JavaScript code or browser extensions/plugins.
The content has the appropriated structure.
- IMPORTDATA works with structured content as csv or tsv doesn't matter of the file extension of the resource.
- IMPORTFEED works with marked up content as ATOM/RSS
- IMPORTHTML works with marked up content as HTML that includes properly markedup list or tables.
- IMPORTXML works with marked up content as XML or any of its variants like XHTML.
Google servers are not blocked by means of robots.txt or the user agent.

On W3C Markup Validator there are several tools to checkout is the resources had been properly marked up.

Regarding CSV check out Are there known services to validate CSV files

It's worth to note that the spreadsheet

should have enough room for the imported content; Google Sheets has a 5 million cell limit by spreadsheet, according to this post a columns limit of 18278, and a 50 thousand characters as cell content even as a value or formula.
it doesn't handle well large in-cell content; the "limit" depends on the user screen size and resolution as now it's possible to zoom in/out.

References

The following question is about a different result, #N/A Could not fetch url

Inability to use `IMPORTHTML` in Google sheets

Best Answer

Tl;Dr

Content added dynamically

Use of robots.txt to block webcrawlers

Use of User agent

Related Solutions

Google Sheets ImportXML – How to Import a JSON Query from Investing.com

Google-sheets – Manual importhtml()

Related Topic