R – extracting content of content attribute in meta tag of a website given a specified value for the name attribute with nokogiri in ruby

nokogirirubyxpath

My first question here, would be awesome to find an answer. I am new to using nokogiri.

Here is my problem. I have something like this in the HTML head on a target site (here a techcrunch post):

<meta content="During my time at TechCrunch I've seen thousands of startups and written about hundreds of them. I sure as hell don't know all ..." name="description"/>

I would now like to have a script to run through the meta tags, locate the one with the name attribute "description" and get what is in the content attribute.

I have tried something like this

require 'rubygems'
require 'nokogiri'
require 'open-uri'

url = "http://www.techcrunch.com/2009/10/11/the-underutilized-power-of-the-video-demo-to-explain-what-the-hell-you-actually-do/"
doc = Nokogiri::HTML(open(url))
posts = doc.xpath("//meta")
posts.each do |link|
  a = link.attributes['name']
  b = link.attributes['content']
end

after which I could select the link where the attribute name is equal to description – but this code returns nil for a and b.

I played around with
posts = doc.xpath("//meta"), posts = doc.xpath("//meta/*"), etc. but still nil.

Best Answer

The problem is not with the xpath, as it seems the document does not parse. You can check that with puts doc, it does not contain the full input. It seems to be a problem with parsing comments (I suspect either invalid HTML or a bug in libxml2).

In your case I would use a regular expression as workaround. Given that <meta> tags are simple enough that might work, eg /<meta name="([^"]*)" content="([^"]*)"/

Related Topic