我有带标题的xml文件
<?xml version="1.0" encoding="utf-16"?>
并且它包含
<transmission xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
使用它不会解析的SAX解析器。但是在传输后手动删除编码部分和属性; XML解析成功。作为文件很大;我只能使用SAX。有任何其他方法来解析这个xml文件而无需手动删除编码和传输属性。
示例代码
require 'nokogiri'
include Nokogiri
class P < Nokogiri::XML::SAX::Document
def initialize
end
def start_element(element, attributes = [])
puts element
end
def cdata_block(string)
end
def characters(string)
end
def end_element(element)
puts element
end
end
parser = Nokogiri::XML::SAX::Parser.new(P.new())
parser.parse_file('file_dummy.xml')
答案 0 :(得分:0)
尝试实施SAX方法套件,看看你得到了什么:
require 'nokogiri'
class MyDoc < Nokogiri::XML::SAX::Document
def cdata_block(str)
puts "cdata_block: #{str}"
end
def characters(str)
puts "characters: #{str}"
end
def comment(str)
puts "comment: #{str}"
end
def end_element(str)
puts "end_element: #{str}"
end
def end_document
puts "end_document"
end
def end_element_namespace(name, prefix = nil, uri = nil)
puts "end_element_namespace: name: #{name} prefix: #{prefix} uri: #{uri}"
end
def error(str)
puts "error:#{str}"
end
def processing_instruction(name, content)
puts "processing_instruction: name: #{name} content: #{content}"
end
def start_document
puts "start_document"
end
def start_element(str, attrs = [])
puts "start_element: #{str} attrs: #{attrs}"
end
def start_element_namespace(name, attrs=[], prefix=nil, uri=nil, ns=[])
puts "start_element_namespace: name: #{name} attrs: #{attrs} prefix: #{prefix} uri: #{uri} ns: #{ns}"
end
def warning(str)
puts "warning: #{str}"
end
def xmldecl(version, encoding, standalone)
puts "xmldecl: version: #{version} encoding: #{encoding} standalone: #{standalone}"
end
end
parser = Nokogiri::XML::SAX::Parser.new(MyDoc.new)
parser.parse(File.open(ARGV[0]))
将其保存到脚本并使用以下命令运行:
ruby path/to/script.rb path/to/file.xml
你应该看到输出。例如,将以下内容用作简单的XML文件:
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
</catalog>
我得到以下输出:
xmldecl: version: 1.0 encoding: standalone:
start_document
start_element_namespace: name: catalog attrs: [] prefix: uri: ns: []
characters:
start_element_namespace: name: book attrs: [#<struct Nokogiri::XML::SAX::Parser::Attribute localname="id", prefix=nil, uri=nil, value="bk101">] prefix: uri: ns: []
characters:
start_element_namespace: name: author attrs: [] prefix: uri: ns: []
characters: Gambardella, Matthew
end_element_namespace: name: author prefix: uri:
characters:
start_element_namespace: name: title attrs: [] prefix: uri: ns: []
characters: XML Developer's Guide
end_element_namespace: name: title prefix: uri:
characters:
start_element_namespace: name: genre attrs: [] prefix: uri: ns: []
characters: Computer
end_element_namespace: name: genre prefix: uri:
characters:
start_element_namespace: name: price attrs: [] prefix: uri: ns: []
characters: 44.95
end_element_namespace: name: price prefix: uri:
characters:
start_element_namespace: name: publish_date attrs: [] prefix: uri: ns: []
characters: 2000-10-01
end_element_namespace: name: publish_date prefix: uri:
characters:
start_element_namespace: name: description attrs: [] prefix: uri: ns: []
characters: An in-depth look at creating applications
with XML.
end_element_namespace: name: description prefix: uri:
characters:
end_element_namespace: name: book prefix: uri:
characters:
end_element_namespace: name: catalog prefix: uri:
end_document
答案 1 :(得分:0)
经过多次转介。我得到了答案。这是@ thetinman的答案。但没有完全吸收。使用sed命令将utf-16替换为utf-8并解析文件。为什么我需要sed操作是nokogiri导致这个utf-16的问题