XPath – Get Non-Distinct Parent Node Data for Each Child Node

google sheetsimportxmlxpath

Google Spreadsheets has an IMPORTXML function, which I am trying to use to maintain a list of TV episodes, based on the API at http://services.tvrage.com/feeds/episode_list.php?sid=3183.

I am trying to get a table of Date | Season No | Episode No | Title from that XML.

The XML is hierarchical, nesting episodes under seasons.

I can get my date, episode and title columns, using xpath:

  • Date – /Show/Episodelist/Season/episode/airdate
  • Episode No – /Show/Episodelist/Season/episode/seasonnum
  • Title – /Show/Episodelist/Season/episode/title

However, I cannot get seem to find the right xpath to extract the Season number for each episode entry, which is the no attribute to the Season parent node.

I have tried:

  • /Show/Episodelist/Season/episode/../@no
  • /Show/Episodelist/Season/episode/parent::Season/@no

Both of these result in getting a distinct list of seasons, i.e.:

Date         Season  Episode  Title
01-01-2001   1       1        foo
02-01-2001   2       2        bar
03-01-2001   3       3        baz
04-01-2001           4        fee
05-01-2001           5        fob
01-03-2002           1        bix
02-03-2002           2        buz
03-03-2002           3        fez
04-03-2002           4        baj

…whereas I am looking for:

Date         Season  Episode  Title
01-01-2001   1       1        foo
02-01-2001   1       2        bar
03-01-2001   1       3        baz
04-01-2001   1       4        fee
05-01-2001   1       5        fob
01-03-2002   2       1        bix
02-03-2002   2       2        buz
03-03-2002   2       3        fez
04-03-2002   2       4        baj

What is the correct xpath so that the IMPORTXML routine does not do a distinct operation on the imported data?

Best Answer

As stated in my comment, doing this with =importXML and XPath seems tricky.

But doing it programmatically, using the XmlService of Google Apps Script, looks easier.

I have written the following:

function parseTvRageXml(url) {
   var result = [];
   var xml = UrlFetchApp.fetch(url).getContentText();
   var document = XmlService.parse(xml);
   var root = document.getRootElement();

   var show = document.getRootElement();
   var episodeList = show.getChild("Episodelist");
   var seasons = episodeList.getChildren("Season");
   for (var i = 0; i < seasons.length; i++) {
     var season = seasons[i];
     var seasonNum = season.getAttribute("no").getValue();
     var episodes = season.getChildren("episode");
     for (var j = 0; j < episodes.length; j++) {
       var episode = episodes[j];
       var resultRow = [];
       resultRow.push(seasonNum);
       resultRow.push(episode.getChild("epnum").getText());
       resultRow.push(episode.getChild("seasonnum").getText());
       resultRow.push(episode.getChild("airdate").getText());
       resultRow.push(episode.getChild("link").getText());
       resultRow.push(episode.getChild("title").getText());
       result.push(resultRow);                
     }
   }
  return result;
 }

In effect, it fetches the content from the given URL, parses it as XML, and iterates through the elements, extracting values of the elements you wanted, which is stored in a 2-dimensional array (result), and returned.

To use this, you must install the script in your spreadsheet. Click the Tools menu → Script editor, and paste the code listed above. Save the script.

In the cell where you want the output, enter

=parseTvRageXml("http://services.tvrage.com/feeds/episode_list.php?sid=3183")

The result should be a list of episodes, with the following columns:

Season number | epnum | seasonnum | airdate | link | title

I have set up an example spreadsheet to demonstrate this, feel free to copy it.