Question

我的第一个问题是，找到答案真是太棒了。我是新手使用nokogiri。

这是我的问题。我在目标网站的HTML头上有这样的东西（这里是一个techcrunch帖子）：

<meta content="During my time at TechCrunch I've seen thousands of startups and written about hundreds of them. I sure as hell don't know all ..." name="description"/>

我现在想要一个脚本来运行元标记，找到名称属性为“description”的脚本，并获取内容属性中的内容。

我尝试过这样的事情

require 'rubygems'
require 'nokogiri'
require 'open-uri'

url = "http://www.techcrunch.com/2009/10/11/the-underutilized-power-of-the-video-demo-to-explain-what-the-hell-you-actually-do/"
doc = Nokogiri::HTML(open(url))
posts = doc.xpath("//meta")
posts.each do |link|
  a = link.attributes['name']
  b = link.attributes['content']
end

之后我可以选择属性名称等于描述的链接 - 但是此代码对于a和b返回nil。

我玩弄了 posts = doc.xpath("//meta")，posts = doc.xpath("//meta/*")等，但仍然没有。

Answer 1

问题不在于xpath，因为它似乎没有解析文档。您可以使用puts doc检查它，它不包含完整输入。解析注释似乎是一个问题（我怀疑HTML无效或libxml2中的错误）。

在您的情况下，我会使用正则表达式作为解决方法。鉴于<meta>标签很简单，可能有用，例如/<meta name="([^"]*)" content="([^"]*)"/

Answer 2

你应该改变

doc = Nokogiri::HTML(open(url))

到

doc = Nokogiri::HTML(open(url).read)

更新：或者可能不是:)实际上你的代码适用于我，使用ruby 1.8.7 / nokogiri 1.4.0

在ruby中用nokogiri提取name属性的指定值的网站元标记中的content属性的内容？

2 个答案: