使用SAX与Nokogiri时处理实体

时间:2014-05-17 03:10:06

标签: ruby xml nokogiri sax

我试图解析名为JMnedict.xml的文件,这是一个日本名字的字典。希望非ASCII字符不会导致解析出现任何问题,但我不能排除这种可能性。原始文件链接到http://www.csse.monash.edu.au/~jwb/enamdict_doc.html,但它非常大。

由于XML文件太大,我使用的是SAX。

在解析原始文件时,使用以下使用SAX的调试代码,它看起来有效,直到遇到第一个实体,然后停止工作。特别是,一旦遇到第一个错误,你就看不到元素的开始和结束。

# sax_replication.rb
require "nokogiri"

class SaxReplication
  def self.parse(filename)
    parser = Nokogiri::XML::SAX::Parser.new(EnamdictDocument.new)
    parser.parse(File.open(filename))
  end
end

class EnamdictDocument < Nokogiri::XML::SAX::Document
  def start_element(name, attributes = [])
    puts "DEBUG: #{name} started"
  end

  def end_element(name)
    puts "DEBUG: #{name} ended"
  end

  def error(string)
    puts "ERROR: #{string}"
  end
end

if __FILE__ == $0
  filename = ARGV[0]
  SaxReplication.parse(filename)
end

给出

$ ruby sax_replication.rb data/JMnedict.xml | head -n20
DEBUG: JMnedict started
DEBUG: entry started
DEBUG: k_ele started
DEBUG: keb started
DEBUG: keb ended
DEBUG: k_ele ended
DEBUG: r_ele started
DEBUG: reb started
DEBUG: reb ended
DEBUG: r_ele ended
DEBUG: trans started
DEBUG: name_type started
ERROR: Entity 'given' not defined
ERROR: Entity 'fem' not defined
ERROR: Entity 'fem' not defined
ERROR: Entity 'given' not defined
ERROR: Entity 'fem' not defined
ERROR: Entity 'given' not defined
ERROR: Entity 'fem' not defined
ERROR: Entity 'surname' not defined

我怀疑它与SAX有关,因为如果我使用手动创建的文件小得多,它可以使用传统的解析器,但不能使用SAX解析器。

$ ruby sax_replication.rb data/tinyJMnedict.xml
DEBUG: JMnedict started
DEBUG: entry started
DEBUG: k_ele started
DEBUG: keb started
DEBUG: keb ended
DEBUG: k_ele ended
DEBUG: r_ele started
DEBUG: reb started
DEBUG: reb ended
DEBUG: r_ele ended
DEBUG: trans started
DEBUG: name_type started
ERROR: Entity 'given' not defined

(使用传统XML解析的结果)

# conventional_xml.rb
require "nokogiri"

class ConventionalXML
  def self.parse(filename)
    text = File.read(filename)
    nokogiri = Nokogiri::XML(text)
    trans_det_xpath = '/JMnedict/entry/trans/trans_det'
    trans_det_node = nokogiri.xpath(trans_det_xpath).first
    content = trans_det_node.content
    puts content
  end
end

if __FILE__ == $0
  filename = ARGV[0]
  ConventionalXML.parse(filename)
end

$ ruby conventional_xml.rb data/tinyJMnedict.xml
Chusen

似乎有XML文件中的实体的描述。

以下是我使用的data / tinyJMnedict.xml文件,它包含实体的描述。

<?xml version="1.0"?>
<!DOCTYPE JMnedict [
<!--
        This is the DTD of the Japanese-Multilingual Named Entity
        Dictionary file. It is based on the JMdict DTD, and carries
        many fields from it. It is used for a quick-and-dirty conversion
        of the ENAMDICT entries, plus the name entries from EDICTH.
-->
<!ELEMENT JMnedict (entry*)>
<!--                                                                   -->
<!ELEMENT entry (k_ele*, r_ele+, trans+)*>
        <!-- Entries consist of kanji elements, reading elements 
        name translation elements. Each entry must have at 
        least one reading element and one sense element. Others are optional.
        -->
<!ELEMENT k_ele (keb, ke_inf*, ke_pri*)>
        <!-- The kanji element, or in its absence, the reading element, is 
        the defining component of each entry.
        The overwhelming majority of entries will have a single kanji
        element associated with an entity name in Japanese. Where there are 
        multiple kanji elements within an entry, they will be orthographical
        variants of the same word, either using variations in okurigana, or
        alternative and equivalent kanji. Common "mis-spellings" may be 
        included, provided they are associated with appropriate information
        fields. Synonyms are not included; they may be indicated in the
        cross-reference field associated with the sense element.
        -->
<!ELEMENT keb (#PCDATA)>
        <!-- This element will contain an entity name in Japanese 
        which is written using at least one non-kana character (usually
        kanji, but can be other characters). The valid 
        characters are kanji, kana, related characters such as chouon and 
        kurikaeshi, and in exceptional cases, letters from other alphabets.
        -->
<!ELEMENT ke_inf (#PCDATA)>
        <!-- This is a coded information field related specifically to the 
        orthography of the keb, and will typically indicate some unusual
        aspect, such as okurigana irregularity.
        -->
<!ELEMENT ke_pri (#PCDATA)>
        <!-- This and the equivalent re_pri field are provided to record
        information about the relative priority of the entry, and are for
        use either by applications which want to concentrate on entries of 
        a particular priority, or to generate subset files. The reason
        both the kanji and reading elements are tagged is because on
        occasions a priority is only associated with a particular
        kanji/reading pair.
        -->
<!--                                                                   -->
<!ELEMENT r_ele (reb, re_restr*, re_inf*, re_pri*)>
        <!-- The reading element typically contains the valid readings
        of the word(s) in the kanji element using modern kanadzukai. 
        Where there are multiple reading elements, they will typically be
        alternative readings of the kanji element. In the absence of a 
        kanji element, i.e. in the case of a word or phrase written
        entirely in kana, these elements will define the entry.
        -->
<!ELEMENT reb (#PCDATA)>
        <!-- this element content is restricted to kana and related
        characters such as chouon and kurikaeshi. Kana usage will be
        consistent between the keb and reb elements; e.g. if the keb
        contains katakana, so too will the reb.
        -->
<!ELEMENT re_restr (#PCDATA)>
        <!-- This element is used to indicate when the reading only applies
        to a subset of the keb elements in the entry. In its absence, all
        readings apply to all kanji elements. The contents of this element 
        must exactly match those of one of the keb elements.
        -->
<!ELEMENT re_inf (#PCDATA)>
        <!-- General coded information pertaining to the specific reading.
        Typically it will be used to indicate some unusual aspect of 
        the reading. -->
<!ELEMENT re_pri (#PCDATA)>
        <!-- See the comment on ke_pri above. -->
<!ELEMENT trans (name_type*, trans_det*)>
        <!-- The trans element will record the translational equivalent
        of the Japanese name, plus other related information. 
        -->
<!ELEMENT name_type (#PCDATA)>
        <!-- The type of name, recorded in the appropriate entity codes.
        -->
<!ELEMENT trans_det (#PCDATA)>
        <!-- The actual translations of the name, usually as a transcription
        into the target language.
        -->
<!ATTLIST trans_det xml:lang CDATA #IMPLIED>
        <!-- The xml:lang attribute defines the target language of the
        translated name. It will be coded using the three-letter language 
        code from the ISO 639-2 standard. When absent, the value "eng" 
        (i.e. English) is the default value. The bibliographic (B) codes
        are used.-->
<!-- The following entity codes are used for common elements within the
various information fields.
-->
<!ENTITY surname "family or surname">
<!ENTITY place "place name">
<!ENTITY unclass "unclassified name">
<!ENTITY company "company name">
<!ENTITY product "product name">
<!ENTITY masc "male given name or forename">
<!ENTITY fem "female given name or forename">
<!ENTITY person "full name of a particular person">
<!ENTITY given "given name or forename, gender not specified">
<!ENTITY station "railway station">
<!ENTITY organization "organization name">
<!ENTITY oik "old or irregular kana form">
]>
<!-- JMnedict created: 2014-05-05 -->
<JMnedict>
<entry>
<k_ele>
<keb>ゝ泉</keb>
</k_ele>
<r_ele>
<reb>ちゅせん</reb>
</r_ele>
<trans>
<name_type>&given;</name_type>
<trans_det>Chusen</trans_det>
</trans>
</entry>
</JMnedict>

name_type元素中的实体包含一些有用的信息,但如果需要,我可以忽略它们。

如何避免在Nokogiri中使用SAX解析导致错误的实体?

0 个答案:

没有答案