如何使用Nokogiri以XML格式标题标签中访问CDATA?

时间:2016-01-17 07:07:28

标签: ruby xml xpath nokogiri

以下是我正在使用的XML示例:

<item rdf:about="http://auburn.craigslist.org/cpg/5368609005.html">
<title><![CDATA[Help Wanted for Online Business]]></title>
<link>http://auburn.craigslist.org/cpg/5368609005.html</link>
<description><![CDATA[Create a safer environment for your children and WORK FROM HOME helping others do the same. 
NO Sales, No Home parties, No Tele-marketing! 1/2 computer 1/2 telephone .......No Risk Involved!........High speed Internet and telephone with long distance [...]]]></description>
<dc:date>2016-01-16T09:14:35-06:00</dc:date>
<dc:language>en-us</dc:language>
<dc:rights>&#x26;copy; 2016 &#x3C;span class="desktop"&#x3E;craigslist&#x3C;/span&#x3E;&#x3C;span class="mobile"&#x3E;CL&#x3C;/span&#x3E;</dc:rights>
<dc:source>http://auburn.craigslist.org/cpg/5368609005.html</dc:source>
<dc:title><![CDATA[Help Wanted for Online Business]]></dc:title>
<dc:type>text</dc:type>
<dcterms:issued>2016-01-16T09:14:35-06:00</dcterms:issued>
</item>

我这样做了:

    doc = Nokogiri::HTML(open(content_url)) do |config|
        config.strict.noblanks
    end
        bq = doc.xpath("//item")

当我尝试用pry调试它时,这就是它告诉我bq的样子:

[5] pry(main)> bq.first
=> #(Element:0x3fbfec8f9788 {
  name = "item",
  attributes = [ #(Attr:0x3fbfec8f195c { name = "rdf:about", value = "http://auburn.craigslist.org/cpg/5368609005.html" })],
  children = [
    #(Element:0x3fbfec8e939c { name = "title" }),
    #(Element:0x3fbfec8e0b98 { name = "link" }),
    #(Text "http://auburn.craigslist.org/cpg/5368609005.html\n"),
    #(Element:0x3fbfec8dd18c { name = "description" }),
    #(Element:0x3fbfed088e68 { name = "date", children = [ #(Text "2016-01-16T09:14:35-06:00")] }),
    #(Element:0x3fbfed079620 { name = "language", children = [ #(Text "en-us")] }),
    #(Element:0x3fbfec8d1044 { name = "rights", children = [ #(Text "&copy; 2016 <span class=\"desktop\">craigslist</span><span class=\"mobile\">CL</span>")] }),
    #(Element:0x3fbfed054050 { name = "source", children = [ #(Text "http://auburn.craigslist.org/cpg/5368609005.html")] }),
    #(Element:0x3fbfed025408 { name = "title" }),
    #(Element:0x3fbfec89d2a8 { name = "type", children = [ #(Text "text")] }),
    #(Element:0x3fbfec85e79c { name = "issued", children = [ #(Text "2016-01-16T09:14:35-06:00")] })]
  })

请注意,在Nokogiri中,具有CDATA值/文本的3个字段都是空白的。

具体来说,我指的是这些内容:

<title><![CDATA[Help Wanted for Online Business]]></title>
<description><![CDATA[Create a safer environment for your children and WORK FROM HOME helping others do the same. 
NO Sales, No Home parties, No Tele-marketing! 1/2 computer 1/2 telephone .......No Risk Involved!........High speed Internet and telephone with long distance [...]]]></description>
<dc:title><![CDATA[Help Wanted for Online Business]]></dc:title>

产生了这些结果:

[5] pry(main)> bq.first
    #(Element:0x3fbfec8e939c { name = "title" }),
    #(Element:0x3fbfec8dd18c { name = "description" }),
    #(Element:0x3fbfed025408 { name = "title" }),

为什么这些值是空白的,我如何专门查找并获取CDATA文本?

1 个答案:

答案 0 :(得分:3)

使用Nokogiri::XML,而不是Nokogiri::HTML

2.3.0 :023 > doc = Nokogiri::XML(data)
 => #<Nokogiri::XML::Document:0x3fcd0e41a42c name="document" children=[#<Nokogiri::XML::Element:0x3fcd0e417f60 name="item" attributes=[#<Nokogiri::XML::Attr:0x3fcd0e417efc name="rdf:about" value="http://auburn.craigslist.org/cpg/5368609005.html">] children=[#<Nokogiri::XML::Text:0x3fcd0e417948 "\n">, #<Nokogiri::XML::Element:0x3fcd0e417830 name="title" children=[#<Nokogiri::XML::CDATA:0x3fcd0e417574 "Help Wanted for Online Business">]>, #<Nokogiri::XML::Text:0x3fcd0e417204 "\n">, #<Nokogiri::XML::Element:0x3fcd0e41709c name="link" children=[#<Nokogiri::XML::Text:0x3fcd0e416cf0 "http://auburn.craigslist.org/cpg/5368609005.html">]>, #<Nokogiri::XML::Text:0x3fcd0e416aac "\n">, #<Nokogiri::XML::Element:0x3fcd0e4169bc name="description" children=[#<Nokogiri::XML::CDATA:0x3fcd0e416728 "Create a safer environment for your children and WORK FROM HOME helping others do the same. \nNO Sales, No Home parties, No Tele-marketing! 1/2 computer 1/2 telephone .......No Risk Involved!........High speed Internet and telephone with long distance [...]">]>, #<Nokogiri::XML::Text:0x3fcd0e416444 "\n">, #<Nokogiri::XML::Element:0x3fcd0e41632c name="dc:date" children=[#<Nokogiri::XML::Text:0x3fcd0e413e74 "2016-01-16T09:14:35-06:00">]>, #<Nokogiri::XML::Text:0x3fcd0e4135f0 "\n">, #<Nokogiri::XML::Element:0x3fcd0e413028 name="dc:language" children=[#<Nokogiri::XML::Text:0x3fcd0e41277c "en-us">]>, #<Nokogiri::XML::Text:0x3fcd0e412588 "\n">, #<Nokogiri::XML::Element:0x3fcd0e412420 name="dc:rights" children=[#<Nokogiri::XML::Text:0x3fcd0e412128 "&copy; 2016 <span class=\"desktop\">craigslist</span><span class=\"mobile\">CL</span>">]>, #<Nokogiri::XML::Text:0x3fcd0e40fe78 "\n">, #<Nokogiri::XML::Element:0x3fcd0e40fd9c name="dc:source" children=[#<Nokogiri::XML::Text:0x3fcd0e40fae0 "http://auburn.craigslist.org/cpg/5368609005.html">]>, #<Nokogiri::XML::Text:0x3fcd0e40f7c0 "\n">, #<Nokogiri::XML::Element:0x3fcd0e40f6bc name="dc:title" children=[#<Nokogiri::XML::CDATA:0x3fcd0e40f3c4 "Help Wanted for Online Business">]>, #<Nokogiri::XML::Text:0x3fcd0e40f0cc "\n">, #<Nokogiri::XML::Element:0x3fcd0e40ef78 name="dc:type" children=[#<Nokogiri::XML::Text:0x3fcd0e40eb2c "text">]>, #<Nokogiri::XML::Text:0x3fcd0e40e6a4 "\n">, #<Nokogiri::XML::Element:0x3fcd0e40e500 name="dcterms:issued" children=[#<Nokogiri::XML::Text:0x3fcd0e40e08c "2016-01-16T09:14:35-06:00">]>, #<Nokogiri::XML::Text:0x3fcd0e407944 "\n">]>]>
2.3.0 :026 > doc.at_xpath('//title')
 => #<Nokogiri::XML::Element:0x3fcd0e417830 name="title" children=[#<Nokogiri::XML::CDATA:0x3fcd0e417574 "Help Wanted for Online Business">]>
2.3.0 :027 > doc.at_xpath('//title').text
 => "Help Wanted for Online Business"