我试图解析名为JMnedict.xml的文件,这是一个日本名字的字典。希望非ASCII字符不会导致解析出现任何问题,但我不能排除这种可能性。原始文件链接到http://www.csse.monash.edu.au/~jwb/enamdict_doc.html,但它非常大。
由于XML文件太大,我使用的是SAX。
在解析原始文件时,使用以下使用SAX的调试代码,它看起来有效,直到遇到第一个实体,然后停止工作。特别是,一旦遇到第一个错误,你就看不到元素的开始和结束。
# sax_replication.rb
require "nokogiri"
class SaxReplication
def self.parse(filename)
parser = Nokogiri::XML::SAX::Parser.new(EnamdictDocument.new)
parser.parse(File.open(filename))
end
end
class EnamdictDocument < Nokogiri::XML::SAX::Document
def start_element(name, attributes = [])
puts "DEBUG: #{name} started"
end
def end_element(name)
puts "DEBUG: #{name} ended"
end
def error(string)
puts "ERROR: #{string}"
end
end
if __FILE__ == $0
filename = ARGV[0]
SaxReplication.parse(filename)
end
给出
$ ruby sax_replication.rb data/JMnedict.xml | head -n20
DEBUG: JMnedict started
DEBUG: entry started
DEBUG: k_ele started
DEBUG: keb started
DEBUG: keb ended
DEBUG: k_ele ended
DEBUG: r_ele started
DEBUG: reb started
DEBUG: reb ended
DEBUG: r_ele ended
DEBUG: trans started
DEBUG: name_type started
ERROR: Entity 'given' not defined
ERROR: Entity 'fem' not defined
ERROR: Entity 'fem' not defined
ERROR: Entity 'given' not defined
ERROR: Entity 'fem' not defined
ERROR: Entity 'given' not defined
ERROR: Entity 'fem' not defined
ERROR: Entity 'surname' not defined
我怀疑它与SAX有关,因为如果我使用手动创建的文件小得多,它可以使用传统的解析器,但不能使用SAX解析器。
$ ruby sax_replication.rb data/tinyJMnedict.xml
DEBUG: JMnedict started
DEBUG: entry started
DEBUG: k_ele started
DEBUG: keb started
DEBUG: keb ended
DEBUG: k_ele ended
DEBUG: r_ele started
DEBUG: reb started
DEBUG: reb ended
DEBUG: r_ele ended
DEBUG: trans started
DEBUG: name_type started
ERROR: Entity 'given' not defined
(使用传统XML解析的结果)
# conventional_xml.rb
require "nokogiri"
class ConventionalXML
def self.parse(filename)
text = File.read(filename)
nokogiri = Nokogiri::XML(text)
trans_det_xpath = '/JMnedict/entry/trans/trans_det'
trans_det_node = nokogiri.xpath(trans_det_xpath).first
content = trans_det_node.content
puts content
end
end
if __FILE__ == $0
filename = ARGV[0]
ConventionalXML.parse(filename)
end
$ ruby conventional_xml.rb data/tinyJMnedict.xml
Chusen
似乎有XML文件中的实体的描述。
以下是我使用的data / tinyJMnedict.xml文件,它包含实体的描述。
<?xml version="1.0"?>
<!DOCTYPE JMnedict [
<!--
This is the DTD of the Japanese-Multilingual Named Entity
Dictionary file. It is based on the JMdict DTD, and carries
many fields from it. It is used for a quick-and-dirty conversion
of the ENAMDICT entries, plus the name entries from EDICTH.
-->
<!ELEMENT JMnedict (entry*)>
<!-- -->
<!ELEMENT entry (k_ele*, r_ele+, trans+)*>
<!-- Entries consist of kanji elements, reading elements
name translation elements. Each entry must have at
least one reading element and one sense element. Others are optional.
-->
<!ELEMENT k_ele (keb, ke_inf*, ke_pri*)>
<!-- The kanji element, or in its absence, the reading element, is
the defining component of each entry.
The overwhelming majority of entries will have a single kanji
element associated with an entity name in Japanese. Where there are
multiple kanji elements within an entry, they will be orthographical
variants of the same word, either using variations in okurigana, or
alternative and equivalent kanji. Common "mis-spellings" may be
included, provided they are associated with appropriate information
fields. Synonyms are not included; they may be indicated in the
cross-reference field associated with the sense element.
-->
<!ELEMENT keb (#PCDATA)>
<!-- This element will contain an entity name in Japanese
which is written using at least one non-kana character (usually
kanji, but can be other characters). The valid
characters are kanji, kana, related characters such as chouon and
kurikaeshi, and in exceptional cases, letters from other alphabets.
-->
<!ELEMENT ke_inf (#PCDATA)>
<!-- This is a coded information field related specifically to the
orthography of the keb, and will typically indicate some unusual
aspect, such as okurigana irregularity.
-->
<!ELEMENT ke_pri (#PCDATA)>
<!-- This and the equivalent re_pri field are provided to record
information about the relative priority of the entry, and are for
use either by applications which want to concentrate on entries of
a particular priority, or to generate subset files. The reason
both the kanji and reading elements are tagged is because on
occasions a priority is only associated with a particular
kanji/reading pair.
-->
<!-- -->
<!ELEMENT r_ele (reb, re_restr*, re_inf*, re_pri*)>
<!-- The reading element typically contains the valid readings
of the word(s) in the kanji element using modern kanadzukai.
Where there are multiple reading elements, they will typically be
alternative readings of the kanji element. In the absence of a
kanji element, i.e. in the case of a word or phrase written
entirely in kana, these elements will define the entry.
-->
<!ELEMENT reb (#PCDATA)>
<!-- this element content is restricted to kana and related
characters such as chouon and kurikaeshi. Kana usage will be
consistent between the keb and reb elements; e.g. if the keb
contains katakana, so too will the reb.
-->
<!ELEMENT re_restr (#PCDATA)>
<!-- This element is used to indicate when the reading only applies
to a subset of the keb elements in the entry. In its absence, all
readings apply to all kanji elements. The contents of this element
must exactly match those of one of the keb elements.
-->
<!ELEMENT re_inf (#PCDATA)>
<!-- General coded information pertaining to the specific reading.
Typically it will be used to indicate some unusual aspect of
the reading. -->
<!ELEMENT re_pri (#PCDATA)>
<!-- See the comment on ke_pri above. -->
<!ELEMENT trans (name_type*, trans_det*)>
<!-- The trans element will record the translational equivalent
of the Japanese name, plus other related information.
-->
<!ELEMENT name_type (#PCDATA)>
<!-- The type of name, recorded in the appropriate entity codes.
-->
<!ELEMENT trans_det (#PCDATA)>
<!-- The actual translations of the name, usually as a transcription
into the target language.
-->
<!ATTLIST trans_det xml:lang CDATA #IMPLIED>
<!-- The xml:lang attribute defines the target language of the
translated name. It will be coded using the three-letter language
code from the ISO 639-2 standard. When absent, the value "eng"
(i.e. English) is the default value. The bibliographic (B) codes
are used.-->
<!-- The following entity codes are used for common elements within the
various information fields.
-->
<!ENTITY surname "family or surname">
<!ENTITY place "place name">
<!ENTITY unclass "unclassified name">
<!ENTITY company "company name">
<!ENTITY product "product name">
<!ENTITY masc "male given name or forename">
<!ENTITY fem "female given name or forename">
<!ENTITY person "full name of a particular person">
<!ENTITY given "given name or forename, gender not specified">
<!ENTITY station "railway station">
<!ENTITY organization "organization name">
<!ENTITY oik "old or irregular kana form">
]>
<!-- JMnedict created: 2014-05-05 -->
<JMnedict>
<entry>
<k_ele>
<keb>ゝ泉</keb>
</k_ele>
<r_ele>
<reb>ちゅせん</reb>
</r_ele>
<trans>
<name_type>&given;</name_type>
<trans_det>Chusen</trans_det>
</trans>
</entry>
</JMnedict>
name_type元素中的实体包含一些有用的信息,但如果需要,我可以忽略它们。
如何避免在Nokogiri中使用SAX解析导致错误的实体?