Google-sheets – IMPORTXML Google Spreadsheet Amazon review numbers

amazongoogle sheetsimportxmlxml

I am making a spreadsheet which requires me to have the number of amazon reviews of a product. I looked up how to dynamically update the reviews as more came in and came across the IMPORTXML formula.

In this specific example I can not find out how to get the information on the sheet as I keep getting the error "imported xml content can not be parsed."

My current formula is =IMPORTXML("http://www. amazon.com/Optoma-GT1080-1080p-Gaming-Projector/product-reviews/B00M9D4CAK/","//*[@id='cm_cr-product_info']/div/div[1]/div[2]/span")

(Space inserted before "amazon" to avoid the autoconversion of the link).

This is the piece i want to get.

this is what i'm trying to get

Best Answer

Not every webpage can be scraped with importXML. There are several possible issues:

  1. Some content on the page may be generated with JavaScript, which is not executed by importXML.
  2. The page may be invalid as XML file. A typical HTML page is invalid XML (unless it is an XTHML page). The parser of importXML tries to sort things out, but in the presence of a lot of non-XML-compliant content (e.g., massive inlined scripts served by Amazon) it fails.
  3. The site may forbid bots from accessing certain content.

With Amazon you are facing the second and third obstacles. A way to get around the second issue is to use UrlFetchApp.fetch(url).getContentText() from Apps Script. This gets the entire source of the page, which you can then parse in any way you want, e.g., with regular expressions.

One has to be careful with scraping Amazon. When experimenting, I sometimes got the following message instead of the page content.

To discuss automated access to Amazon data please contact api-services-support@amazon.com.

But here is an Apps Script function that grabs the number of reviews of this product and inserts it into cell A1 of the current sheet.

function reviewCount() {   // remove space before amazon on the next line 
  var url = "http://www. amazon.com/Optoma-GT1080-1080p-Gaming-Projector/product-reviews/B00M9D4CAK/";
  var content = UrlFetchApp.fetch(url).getContentText();
  var match = /totalReviewCount">([0-9, ]+)/.exec(content);
  SpreadsheetApp.getActiveSheet().getRange("A1").setValue(match[1]);
}

For all I know, this may still be against Amazon's TOS. Related: Is there an Amazon.com API to retrieve product reviews?