我有以下示例XML:
<all>
<houses>
<reg info='<root><h level="2" i="1"> something </h><root>'
other="test"
something
</reg>
</houses>
</all>
我想解析info
标记的<reg>
属性中提供的XML,但我不知道如何将info
属性的内容提供给Nokogiri。
这就是我现在所拥有的:
doc = Nokogiri::HTML(open-uri(mylink))
node = doc.xpath(//houses/reg)
puts node[0]['info'].class #string
#content of info property as string. This is what I want to feed to nokogiri as xml
puts node[0]['info'].text
我该怎么做?
答案 0 :(得分:3)
您需要获取info属性的文本,并使用GCI类来取消HTML。然后,您可以将字符串提供给Nokogiri::HTML
,它将被解析。这样的事情。
require "nokogiri"
require "open-uri"
require "cgi"
doc = Nokogiri::HTML(open-uri("http://example.com/foo.xml"))
node = doc.xpath("//houses/reg")
info_string = CGI.unescapeHTML(node[0]['info'])
info_doc = Nokogiri::XML(info_string)
# Now you can have a Nokogiri document from that attribute.
答案 1 :(得分:0)
require 'nokogiri'
xml = "<all>
<houses>
<reg info='<root><h level=\"2\" i=\"1\"> something </h><root>'
other=\"test\"
something
</reg>
</houses>
</all>"
doc = Nokogiri::HTML(xml)
node = doc.xpath('//houses/reg')
puts node[0]['info'].class #string
puts node[0]['info']
inner_xml = node[0]['info']
inner_doc = Nokogiri::XML(inner_xml)
puts inner_doc.xpath('root/h')[0].text
答案 2 :(得分:0)
以下是需要注意的事项:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<all>
<houses>
<reg info='<root><h level="2" i="1"> something </h><root>'
other="test"
something
</reg>
</houses>
</all>
EOT
doc.errors # => [#<Nokogiri::XML::SyntaxError: Unescaped '<' not allowed in attributes values>, #<Nokogiri::XML::SyntaxError: attributes construct error>, #<Nokogiri::XML::SyntaxError: Couldn't find end of Start Tag reg line 3>, #<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: root line 3 and reg>, #<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: root line 3 and houses>, #<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: houses line 2 and all>, #<Nokogiri::XML::SyntaxError: Premature end of data in tag all line 1>]
doc.at('reg')['info'] # => ""
puts doc.to_xml
# >> <?xml version="1.0"?>
# >> <all>
# >> <houses>
# >> <reg info=""/><root><h level="2" i="1"> something </h><root>'
# >> other="test"
# >> something
# >> </root>
# >> </root>
# >> </houses>
# >> </all>
解析XML通常应使用Nokogiri::XML
,因为XML是严格的规范。此标记格式不正确,Nokogiri将正确标记错误,并且因为格式错误将尝试修复它并继续解析。
使用Nokogiri::HTML
放松缰绳,让解析器对它看到的内容更加宽容;众所周知,HTML写得非常糟糕,所以Nokogiri试图更加宽容:
doc = Nokogiri::HTML(<<EOT)
<all>
<houses>
<reg info='<root><h level="2" i="1"> something </h><root>'
other="test"
something
</reg>
</houses>
</all>
EOT
doc.errors # => [#<Nokogiri::XML::SyntaxError: Tag all invalid>, #<Nokogiri::XML::SyntaxError: Tag houses invalid>, #<Nokogiri::XML::SyntaxError: error parsing attribute name>, #<Nokogiri::XML::SyntaxError: Tag reg invalid>]
doc.at('reg')['info'] # => "<root><h level=\"2\" i=\"1\"> something </h><root>"
puts doc.to_xml
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <all>
# >> <houses>
# >> <reg info='<root><h level="2" i="1"> something </h><root>' other="test" something>
# >> </reg></houses>
# >> </all>
# >> </body></html>
注意Nokogiri现在如何:
info
info
。我不确定Nokogiri的行为是否因为最初被问到的问题而改变了,但是v.1.6.7.2中的当前行为正确地处理了解码而无需使用CGI。
答案 3 :(得分:0)
以下是需要注意的事项:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<all>
<houses>
<reg info='<root><h level="2" i="1"> something </h><root>'
other="test"
something
</reg>
</houses>
</all>
EOT
doc.errors # => [#<Nokogiri::XML::SyntaxError: Unescaped '<' not allowed in attributes values>, #<Nokogiri::XML::SyntaxError: attributes construct error>, #<Nokogiri::XML::SyntaxError: Couldn't find end of Start Tag reg line 3>, #<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: root line 3 and reg>, #<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: root line 3 and houses>, #<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: houses line 2 and all>, #<Nokogiri::XML::SyntaxError: Premature end of data in tag all line 1>]
doc.at('reg')['info'] # => ""
puts doc.to_xml
# >> <?xml version="1.0"?>
# >> <all>
# >> <houses>
# >> <reg info=""/><root><h level="2" i="1"> something </h><root>'
# >> other="test"
# >> something
# >> </root>
# >> </root>
# >> </houses>
# >> </all>
解析XML通常应使用Nokogiri::XML
,因为XML是严格的规范。此标记格式不正确,Nokogiri将正确标记错误,并且因为格式错误将尝试修复它并继续解析。
使用Nokogiri::HTML
放松缰绳,让解析器对它看到的内容更加宽容;众所周知,HTML写得非常糟糕,所以Nokogiri试图更加宽容:
doc = Nokogiri::HTML(<<EOT)
<all>
<houses>
<reg info='<root><h level="2" i="1"> something </h><root>'
other="test"
something
</reg>
</houses>
</all>
EOT
doc.errors # => [#<Nokogiri::XML::SyntaxError: Tag all invalid>, #<Nokogiri::XML::SyntaxError: Tag houses invalid>, #<Nokogiri::XML::SyntaxError: error parsing attribute name>, #<Nokogiri::XML::SyntaxError: Tag reg invalid>]
doc.at('reg')['info'] # => "<root><h level=\"2\" i=\"1\"> something </h><root>"
puts doc.to_xml
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <all>
# >> <houses>
# >> <reg info='<root><h level="2" i="1"> something </h><root>' other="test" something>
# >> </reg></houses>
# >> </all>
# >> </body></html>
注意Nokogiri现在如何:
info
info
。<html><body>
标记中。要提取固定的XML,需要剥离几层:
puts doc.at('all').to_xml
# >> <all>
# >> <houses>
# >> <reg info="<root><h level="2" i="1"> something </h><root>" other="test" something="">
# >> </reg></houses>
# >> </all>
我不确定Nokogiri的行为是否因为最初被问到的问题而改变了,但是v.1.6.7.2中的当前行为正确地处理了解码而无需使用CGI。
答案 4 :(得分:-1)
node[0].attr('info')
为您提供信息属性