Nokogiri Gem不会使用SAX处理程序解析文件

时间:2017-01-23 15:41:33

标签: ruby-on-rails ruby xml nokogiri sax

我有带标题的xml文件

<?xml version="1.0" encoding="utf-16"?>

并且它包含

<transmission xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">

使用它不会解析的SAX解析器。但是在传输后手动删除编码部分和属性; XML解析成功。作为文件很大;我只能使用SAX。有任何其他方法来解析这个xml文件而无需手动删除编码和传输属性。

示例代码

      require 'nokogiri'
        include Nokogiri



class P < Nokogiri::XML::SAX::Document

      def initialize
      end

      def start_element(element, attributes = [])
        puts element
      end

      def cdata_block(string)
      end

      def characters(string)
      end

      def end_element(element)
        puts element
      end
 end

    parser = Nokogiri::XML::SAX::Parser.new(P.new())
    parser.parse_file('file_dummy.xml')

2 个答案:

答案 0 :(得分:0)

尝试实施SAX方法套件,看看你得到了什么:

require 'nokogiri'

class MyDoc < Nokogiri::XML::SAX::Document
  def cdata_block(str)
    puts "cdata_block: #{str}"
  end

  def characters(str)
    puts "characters: #{str}"
  end

  def comment(str)
    puts "comment: #{str}"
  end

  def end_element(str)
    puts "end_element: #{str}"
  end

  def end_document
    puts "end_document"
  end

  def end_element_namespace(name, prefix = nil, uri = nil)
    puts "end_element_namespace: name: #{name} prefix: #{prefix} uri: #{uri}"
  end

  def error(str)
    puts "error:#{str}"
  end

  def processing_instruction(name, content)
    puts "processing_instruction: name: #{name} content: #{content}"
  end

  def start_document
    puts "start_document"
  end

  def start_element(str, attrs = [])
    puts "start_element: #{str} attrs: #{attrs}"
  end

  def start_element_namespace(name, attrs=[], prefix=nil, uri=nil, ns=[])
    puts "start_element_namespace: name: #{name} attrs: #{attrs} prefix: #{prefix} uri: #{uri} ns: #{ns}"
  end

  def warning(str)
    puts "warning: #{str}"
  end

  def xmldecl(version, encoding, standalone)
    puts "xmldecl: version: #{version} encoding: #{encoding} standalone: #{standalone}"
  end
end

parser = Nokogiri::XML::SAX::Parser.new(MyDoc.new)
parser.parse(File.open(ARGV[0]))

将其保存到脚本并使用以下命令运行:

ruby path/to/script.rb path/to/file.xml

你应该看到输出。例如,将以下内容用作简单的XML文件:

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>
</catalog>

我得到以下输出:

xmldecl: version: 1.0 encoding:  standalone:
start_document
start_element_namespace: name: catalog attrs: [] prefix:  uri:  ns: []
characters:

start_element_namespace: name: book attrs: [#<struct Nokogiri::XML::SAX::Parser::Attribute localname="id", prefix=nil, uri=nil, value="bk101">] prefix:  uri:  ns: []
characters:

start_element_namespace: name: author attrs: [] prefix:  uri:  ns: []
characters: Gambardella, Matthew
end_element_namespace: name: author prefix:  uri:
characters:

start_element_namespace: name: title attrs: [] prefix:  uri:  ns: []
characters: XML Developer's Guide
end_element_namespace: name: title prefix:  uri:
characters:

start_element_namespace: name: genre attrs: [] prefix:  uri:  ns: []
characters: Computer
end_element_namespace: name: genre prefix:  uri:
characters:

start_element_namespace: name: price attrs: [] prefix:  uri:  ns: []
characters: 44.95
end_element_namespace: name: price prefix:  uri:
characters:

start_element_namespace: name: publish_date attrs: [] prefix:  uri:  ns: []
characters: 2000-10-01
end_element_namespace: name: publish_date prefix:  uri:
characters:

start_element_namespace: name: description attrs: [] prefix:  uri:  ns: []
characters: An in-depth look at creating applications
      with XML.
end_element_namespace: name: description prefix:  uri:
characters:

end_element_namespace: name: book prefix:  uri:
characters:
end_element_namespace: name: catalog prefix:  uri:
end_document

答案 1 :(得分:0)

经过多次转介。我得到了答案。这是@ thetinman的答案。但没有完全吸收。使用sed命令将utf-16替换为utf-8并解析文件。为什么我需要sed操作是nokogiri导致这个utf-16的问题