Question

我正在尝试解析此页面并提取

之后开始的日期

>p>From Date:

我收到错误

Invalid predicate: //b[text() = '<p>From Date: ' (Nokogiri::XML::XPath::SyntaxError)

＆＃34;检查元素＆＃34;的xpath;是

/html/body/div#timelineItems/table/tbody/tr/td/table.resultsTypes/tbody/tr/td/p

这是代码示例：

#/usr/bin/ruby

require 'Nokogiri'
noko = Nokogiri::HTML('china.html')
noko.xpath("//b[text() = '<p>From Date: ").each do |b|
puts b.next_sibling.content.strip
end

这是文件：//china.html



    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

    <html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        
        <title>File </title>

    
      </head>
      <body>
        
            <div id ="timelineItems">
    <H2 id="telegram1"> Title </H2>
            <p><table cellspacing="0">
    <tr>
    <td width="2%">&nbsp;</td>
    <td width="75%">
    <table cellspacing="0" cellpadding="0" class="resultsTypes">
    <tr>
    <td width="5%" class="hide">&nbsp;</td>
    <td width="70%">
    <p>Template: <span class="bidi">ארכיון בן גוריון - מסמך</span></p>
    <p>Title: <a href="http://www.bing.com" title=""><span class="bidi">Meeting in China</span></a></p>

    <p>recipient: David Ben Gurion</p>
    <p>sender: Prime Minister of Union of Burma, Rangoon</p>
    <p>  Sub collection: <span class="bidi">התכתבות > תת-חטיבה מכתב</span></p>
    <p>From Date: 02/14/1936</p>
    <p>Link to file: <span class="bidi">תיק התכתבות  1956 ינואר</span></p>
    </td>
    </tr>
    <tr>
    <td colspan="2">
    </td>
    </tr>
    </table></td>
    <td class="actions">&nbsp;</td>
    </tr>
    </table>

    </p>
          </div>
          
    
    </body></html>

Amadan的回答 original.rb

＆＃13;

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

    <html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        
        <title>File </title>

    
      </head>
      <body>
        
            <div id ="timelineItems">
    <H2 id="telegram1"> Title </H2>
            <p><table cellspacing="0">
    <tr>
    <td width="2%">&nbsp;</td>
    <td width="75%">
    <table cellspacing="0" cellpadding="0" class="resultsTypes">
    <tr>
    <td width="5%" class="hide">&nbsp;</td>
    <td width="70%">
    <p>Template: <span class="bidi">ארכיון בן גוריון - מסמך</span></p>
    <p>Title: <a href="http://www.bing.com" title=""><span class="bidi">Meeting in China</span></a></p>

    <p>recipient: David Ben Gurion</p>
    <p>sender: Prime Minister of Union of Burma, Rangoon</p>
    <p>  Sub collection: <span class="bidi">התכתבות > תת-חטיבה מכתב</span></p>
    <p>From Date: 02/14/1936</p>
    <p>Link to file: <span class="bidi">תיק התכתבות  1956 ינואר</span></p>
    </td>
    </tr>
    <tr>
    <td colspan="2">
    </td>
    </tr>
    </table></td>
    <td class="actions">&nbsp;</td>
    </tr>
    </table>

    </p>
          </div>
          
    
    </body></html>

＆＃13;

＆＃13; 给出错误

#/usr/bin/ruby

require 'Nokogiri'
noko = Nokogiri::HTML('china.html')
date = noko.at_xpath("//p[starts-with(text(),'From Date: ')]").text()

puts date

formatted = date[/From Date: (.*)/, 1]

puts formatted

Answer 1

你不能使用

noko = Nokogiri::HTML('china.html')

Nokogiri::HTML是Nokogiri::HTML::Document.parse的快捷方式。 The documentation说：

.parse(string_or_io, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML) {|options| ... } ⇒ Object`
... string_or_io可以是String，也可以是响应读取和关闭的任何对象，例如IO或StringIO。 ...

虽然'china.html'是一个字符串，但它不是HTML。看起来你认为文件名就足够了，但是Nokogiri没有打开任何东西，它只能理解包含标记的字符串，HTML或XML，或者是响应read方法的IO类型对象。比较这些：

require 'nokogiri'

doc = Nokogiri::HTML('china.html')
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>china.html</p></body></html>\n"

与

doc = Nokogiri::HTML('<html><body><p>foo</p></body></html>')
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><p>foo</p></body></html>\n"

和

doc = Nokogiri::HTML(open('http://www.example.org'))
doc.to_html[0..99]
# => "<!DOCTYPE html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset=\"utf-8\">\n    <met"

最后一个可行，因为OpenURI添加了读取open的网址的功能，该网址响应read：

open('http://www.example.org').respond_to?(:read) # => true

继续讨论问题：

require 'nokogiri'
require 'open-uri'

html = <<EOT
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

<html lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

    <title>File </title>


  </head>
  <body>

        <div id ="timelineItems">
<H2 id="telegram1"> Title </H2>
        <p><table cellspacing="0">
<tr>
<td width="2%">&nbsp;</td>
<td width="75%">
<table cellspacing="0" cellpadding="0" class="resultsTypes">
<tr>
<td width="5%" class="hide">&nbsp;</td>
<td width="70%">
<p>Template: <span class="bidi">ארכיון בן גוריון - מסמך</span></p>
<p>Title: <a href="http://www.bing.com" title=""><span class="bidi">Meeting in China</span></a></p>

<p>recipient: David Ben Gurion</p>
<p>sender: Prime Minister of Union of Burma, Rangoon</p>
<p>  Sub collection: <span class="bidi">התכתבות > תת-חטיבה מכתב</span></p>
<p>From Date: 02/14/1936</p>
<p>Link to file: <span class="bidi">תיק התכתבות  1956 ינואר</span></p>
</td>
</tr>
<tr>
<td colspan="2">
</td>
</tr>
</table></td>
<td class="actions">&nbsp;</td>
</tr>
</table>

</p>
      </div>


</body></html>
EOT

doc = Nokogiri::HTML(html)

解析文档后，可以使用

轻松找到特定的<p>标记

<table cellspacing="0" cellpadding="0" class="resultsTypes">

作为地标：

from_date = doc.at('table.resultsTypes p[6]').text
# => "From Date: 02/14/1936"

看起来它更难以拉动标题=“在中国的会面”和link =“bing.com”;因为他们在同一条线上。

我正在使用CSS selectors来定义所需文本的路径。尽管XPath更强大，更具描述性，但CSS比XPath更容易阅读。 Nokogiri允许我们使用其中之一，并允许我们使用search或at。 at相当于search('some selector').first。还有search和at的CSS和XPath特定版本，在Nokogiri::XML::Node中有描述。

title_link = doc.at('table.resultsTypes p[2] a')['href'] # => "http://www.bing.com"
title = doc.at('table.resultsTypes p[2] span').text # => "Meeting in China"

您正在尝试使用XPath：

/html/body/div#timelineItems/table/tbody/tr/td/table.resultsTypes/tbody/tr/td/p

但是，它对您正在使用的HTML无效。

在选择器中注意tbody。在<table>标记之后立即查看HTML，两个事件都没有<tbody>标记，因此XPath是错误的。我怀疑这是由您的浏览器生成的，它根据规范对HTML进行修复以添加<tbody>，但是Nokogiri没有修复添加<tbody>和HTML不匹配，导致搜索失败。因此，不要依赖浏览器定义的选择器，也不要相信浏览器对实际HTML源的想法。

在标记中查找特定的路点，并使用这些路径导航到所需的节点，而不是使用显式选择器，更好，更简单，更智能。这是一个完成上述所有操作的示例，仅使用占位符，以及XPath和CSS的混合：

doc.at('//p[starts-with(., "Title:")]').text  # => "Title: Meeting in China"
title_node = doc.at('//p[starts-with(., "Title:")]')
title_url = title_node.at('a')['href'] # => "http://www.bing.com"
title = title_node.at('span').text # => "Meeting in China"

所以，混合和匹配CSS和XPath很好。

Answer 2

from_date = noko.at_xpath('//p[starts-with(text(), "From Date:")]').text()
date = from_date[/From Date: (.*)/, 1]
# => "02/14/1936"

编辑：

说明：获取文档中的第一个节点（#at_xpath）（//），使（[...]）文本内容（text()）以（{{{ 1}}）starts-with(string, stringStart)（"From Date"），并将其文本内容（"From Date:"），将其存储（#text()）到变量=（{{ 1}}）。然后，使用与文字字符from_date匹配的正则表达式（from_date）从该文本（#[regexp, 1]）中提取第一个组（from_date），后跟任意数字（/.../）任何字符（"From Date: "）的{*），将在.提取的第一个捕获组中捕获（(...)）。

此外，

Amadan的回答[...]给出错误

正如Tin Man所解释的那样，我没有注意到你的Nokogiri建筑被打破了。行#[regexp, 1]（这不是我的答案的一部分）将为您提供单个节点文档，其中只包含文本noko = Nokogiri::HTML('china.html')，而根本没有"china.html"个节点。

如何使用Ruby中的Nokogiri解析日期

2 个答案: