Question

最快的方法是什么？

我可能有html文件可能（或可能不）包含“说明”一词，后面跟几行说明。我想解析这些包含“说明”一词和后面几行的页面。

Answer 1

也许是这样的话

require 'rubygems'
require 'nokogiri'

def find_instructions doc
  doc.xpath('//body//text()').each do |text|
    instructions = text.content.select do |line|
      # flip-flop matches all sections starting with
      # "Instructions" and ending with an empty line
      true if (line =~ /Instructions/)..(line =~ /^$/) 
    end
    return instructions unless instructions.empty?
  end
  return []
end

puts find_instructions(Nokogiri::HTML(DATA.read))


__END__
<html>
<head>
  <title>Instructions</title>
</head>
<body>
lorem
ipsum
<p>
lorem
ipsum
<p>
lorem
ipsum
<p>
Instructions
- Browse stackoverflow
- Answer questions
- ???
- Profit

More
<p>
lorem
ipsum
</body>
</html>

Answer 2

这不是最“正确”的方式，但主要起作用。使用正则表达式查找字符串：ruby regex

你想要的正则表达式是/ instructions（[^＆lt;] +）/。这假设您以＆lt;结尾字符。

Answer 3

您可以从测试文档是否匹配开始：

if open('docname.html').read =~ /Instructions/
  # Parse to remove the instructions.
end

我建议使用Hpricot然后提取你想要的部分 - 根据html的结构，这或多或少会有困难。如果您需要更具体的帮助，请发布有关结构的更多详细信息。

在xhtml文档中查找特定单词的最快方法

3 个答案: