Html – Phantom element using ImportXML with XPath in Google Spreadsheet

google sheetshtmlxpath

I am trying to get the value of an element attribute from this site via importXML in Google Spreadsheet using XPath.

The attribute value i seek is content found in the <span> with itemprop="price".

<div class="left" style="margin-top: 10px;">
    <meta itemprop="currency" content="RON">
        <span class="pret" itemprop="price" content="698,31 RON">
            <p class="pret">Pretul tau:</p>
            698,31 RON
        </span>
...
</div>

I can access <div class="left"> but i can't get to the <span> element.

Tried using:

  • //span[@class='pret']/@content i get #N/A;
  • //span[@itemprop='price']/@content i get #N/A;
  • //div[@class='left']/span[@class='pret' and @itemprop='price']/@content i get #N/A;
  • //div[@class='left']/span[1]/@content i get #N/A;
  • //div[@class='left']/span/text() to get the text node of <span> i get #N/A;
  • //div[@class='left']//span/text() i get the text node of a <span> lower in div.left.

To get the text node of <span> i have to use //div[@class='left']/text(). But i can't use that text node because the layout of the span changes if a product is on sale, so i need the attribute.

It's like the span i'm looking for does not exist, although it appears in the development view of Chrome and in the page source and all XPath work in the console using $x("").

I tried to generate the XPath directly form the development tool by right clicking and i get //*[@id='produs']/div[4]/div[4]/div[1]/span which does not work. I also tried to generate the XPath with Firefox and plugins for FF and Chrome to no avail. The XPath generated in these ways did not even work on sites i managed to scrape with "hand coded XPath".

Now, the strangest thing is that on this other site with apparently similar code structure the XPath //span[@itemprop='price']/@content works.

I struggled with this for 4 days now. I'm starting to think it's something to do with the auto-closing meta tag, but why doesn't this happen on the other site?

Best Answer

Perhaps the following formulas can help you:

=ImportXML("http://...","//div[@class='product-info-price']//div[@class='left']/text()")

Or

=INDEX(ImportXML("http://...","//div[@class='product-info-price']//div[@class='left']"), 1, 2)

UPDATE

It seems that not properly parse the entire document, it fails. A document extraction, something like:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<div class="product-info-price">
    <div class="left" style="margin-top: 10px;">
        <meta itemprop="currency" content="RON">
        <span class="pret" itemprop="price" content="698,31 RON">
            <p class="pret">Pretul tau:</p>
            698,31 RON
        </span>
        <div class="resealed-info">
            <a href="/resigilate/componente-pc/placi-de-baza/" rel="nofollow">» Vezi 1 resigilat din aceasta categorie</a>
        </div>
        <ul style="margin-left: auto;margin-right: auto;width: 200px;text-align: center;margin-top: 20px;">
            <li style="color: #000000; font-size: 11px;">Rata de la <b>28,18 RON</b> prin <a href="http://www.marketonline.ro/rate-sapte-stele?amount=698.31#brdfinance" title="BRD Finance" target="_blank" class="rate" rel="nofollow">BRD</a></li>
            <li style="color: #5F5F5F;text-align: center;">Pretul include TVA</li>
            <li style="color: #5F5F5F;">Cod produs: <span style="margin-left: 0;text-align: center;font-weight: bold;" itemprop="identifier" content="mol:GA-Z87X-UD3H">GA-Z87X-UD3H</span> </li>
        </ul>
    </div>
    <div class="right" style="height: 103px;line-height: 103px;">
        <form action="/?a=shopping&amp;sa=addtocart" method="post" id="add_to_cart_form">
            <input type="hidden" name="product-183641" value="on"/>
            <a href="/adaugaincos-183641" rel="nofollow"><img src="/templates/marketonline/images/pag-prod/buton_cumpara.jpg"/></a>
        </form>
    </div>
</div>
</html>

works with the following XPath query:

"//div[@class='product-info-price']//div[@class='left']//span[@itemprop='price']/@content"

UPDATE

It occurs to me that one option is that you can use Apps Script to create your own ImportXML function, something like:

/* CODE FOR DEMONSTRATION PURPOSES */
function MyImportXML(url) {
  var found, html, content = '';
  var response = UrlFetchApp.fetch(url);
  if (response) {
    html = response.getContentText();
    if (html) content = html.match(/<span class="pret" itemprop="price" content="(.*)">/gi)[0].match(/content="(.*)"/i)[1];
  }
  return content;
}

Then you can use as follows:

=MyImportXML("http://...")
Related Topic