Question

我试图解析一个大的XML文件以获取所有外部XML标记内容，如下所示：

<![CDATA[Hey I'm a tag with & and other characters]]>

得到这个：

Hey I\'m a tag with &amp; and other characters

虽然，当我使用Nokogiri的SAX XML解析器时，我只得到没有CDATA的文本并且字符被转义，如下所示：

  class IDCollector < Nokogiri::XML::SAX::Document
    def initialize
    end

    def characters string
        puts string # this does not works, CDATA tag is not printed  
    end

    def cdata_block string
      puts string
      puts "<![CDATA[" + string + "]]>"
    end
  end

这是我的代码：

hddtemp

Nokogiri SAX有没有办法做到这一点？

Answer 1

我们不清楚你要做什么，但这可能有助于清理事情。

<![CDATA[...]]>条目不是标记，它是一个块，解析器会对其进行不同的处理。遇到阻止时，<![CDATA[和]]>被剥离，因此您只能看到里面的字符串。见＆＃34; What does <![CDATA[]]> in XML mean?＆＃34;了解更多信息。

如果您尝试使用XML创建CDATA块，可以使用以下方法轻松完成：

doc = Nokogiri::XML(%(<string name="key"></string>))
doc.at('string') << Nokogiri::XML::CDATA.new(Nokogiri::XML::Document.new, "Hey I'm a tag with & and other characters")
doc.to_xml # => "<?xml version=\"1.0\"?>\n<string name=\"key\"><![CDATA[Hey I'm a tag with & and other characters]]></string>\n"

<<只是创建子节点的简写。

尝试使用inner_html并不能做你想要的，因为它创建了一个孩子的文本节点：

doc = Nokogiri::XML(%(<string name="key"></string>))
doc.at('string').inner_html = "Hey I'm a tag with & and other characters"
doc.to_xml # => "<?xml version=\"1.0\"?>\n<string name=\"key\">Hey I'm a tag with &amp; and other characters</string>\n"
doc.at('string').children.first.text # => "Hey I'm a tag with & and other characters"
doc.at('string').children.first.class # => Nokogiri::XML::Text

使用inner_html会导致字符串的HTML编码，这是嵌入可能包含标记的文本的替代方法。如果没有编码或使用CDATA，XML解析器可能会对文本与真实标记之间的内容感到困惑。我已经编写了RSS聚合器，并且不得不在Feed中处理错误编码的嵌入式HTML，这很痛苦。

Answer 2

经过一段时间查看文档后，我认为这只能通过在Nokogiri的帮助下构建新的CDATA内容来实现，如下所示：

  tmp = Nokogiri::XML::Document.new
  value = tmp.create_cdata(value)
  r = doc.at_xpath(PATH_TO_REPLACE)
  r.inner_html = value

如何使用SAX获取CDATA内容

2 个答案: