为什么Nokogiri会截断这个元素?

时间:2011-04-07 00:15:34

标签: ruby nokogiri saxparser

我正在使用Nokogiri和Ruby 1.9.2解析XML文件。在我阅读Descriptions(下面)之前,一切似乎都运转正常。该文本正在被截断。输入文本为:

<Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen.

There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color.

The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value>

但相反,我得到了:

g. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.

请注意,它从g.开始,超过一半。

这是完整的XML文件:

<?xml version="1.0" encoding="utf-8"?>
<Hotel>
  <HotelID>1040900</HotelID>
  <HotelFileName>Copthorne_Hotel_Aberdeen</HotelFileName>
  <HotelName>Copthorne Hotel Aberdeen</HotelName>
  <CityID>10</CityID>
  <CityFileName>Aberdeen</CityFileName>
  <CityName>Aberdeen</CityName>
  <CountryCode>GB</CountryCode>
  <CountryFileName>United_Kingdom</CountryFileName>
  <CountryName>United Kingdom</CountryName>
  <StarRating>4</StarRating>
  <Latitude>57.146068572998</Latitude>
  <Longitude>-2.111680030823</Longitude>
  <Popularity>1</Popularity>
  <Address>122 Huntly Street</Address>
  <CurrencyCode>GBP</CurrencyCode>
  <LowRate>36.8354</LowRate>
  <Facilities>1|2|3|5|6|8|10|11|15|17|18|19|20|22|27|29|30|34|36|39|40|41|43|45|47|49|51|53|55|56|60|62|140|154|209</Facilities>
  <NumberOfReviews>239</NumberOfReviews>
  <OverallRating>3.95</OverallRating>
  <CleanlinessRating>3.98</CleanlinessRating>
  <ServiceRating>3.98</ServiceRating>
  <FacilitiesRating>3.83</FacilitiesRating>
  <LocationRating>4.06</LocationRating>
  <DiningRating>3.93</DiningRating>
  <RoomsRating>3.68</RoomsRating>
  <PropertyType>0</PropertyType>
  <ChainID>92</ChainID>
  <Checkin>14</Checkin>
  <Checkout>12</Checkout>
  <Images>
    <Image>19305754</Image>
    <Image>19305755</Image>
    <Image>19305756</Image>
    <Image>19305757</Image>
    <Image>19305758</Image>
    <Image>19305759</Image>
    <Image>19305760</Image>
    <Image>19305761</Image>
    <Image>19305762</Image>
    <Image>19305763</Image>
    <Image>19305764</Image>
    <Image>19305765</Image>
    <Image>19305766</Image>
    <Image>19305767</Image>
    <Image>37102984</Image>
  </Images>
  <Descriptions>
    <Description>
      <Name>General Description</Name>
      <Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen.

There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color.

The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value>
    </Description>
    <Description>
      <Name>LocationDescription</Name>
      <Value>Aberdeen's premier four star hotel located in the city centre just off Union Street and the main business and entertainment areas. Within 10 minutes journey of Aberdeen Railway Station and only 10-20 minutes journey from International Airport.</Value>
    </Description>
  </Descriptions>
</Hotel>

这是我的Ruby程序:

require 'rubygems'
require 'nokogiri'
require 'ap'
include Nokogiri

class Hotel < Nokogiri::XML::SAX::Document

    def initialize
        @h = {}
        @h["Images"] = Array.new([])
        @h["Descriptions"] = Array.new([])
        @desc = {}
    end

    def end_document
      ap @h
        puts "Finished..."
    end

    def start_element(element, attributes = [])
        @element = element

    @desc = {} if element == "Description"
    end

    def end_element(element, attributes = [])     
      @h["Images"] << @characters if element == "Image"
    @desc["Name"] = @characters if element == "Name"
    if element == "Value"
      @desc["Value"] = @characters
      @h["Descriptions"] << @desc
    end

    @h[element] = @characters unless %w(Images Image Descriptions Description Hotel Name Value).include? element
    end

    def characters(string)
        @characters = string
    end  
end

# Create a new parser
parser = Nokogiri::XML::SAX::Parser.new(Hotel.new)

# Feed the parser some XML
parser.parse(File.open("/Users/cbmeeks/Projects/shared/data/text/HotelDatabase_EN/00/1040900.xml", 'rb'))

由于

2 个答案:

答案 0 :(得分:0)

我剥离了XML,因为它有很多不必要的节点来解决问题。以下是我如何处理文字的示例:

#!/usr/bin/env ruby
# encoding: UTF-8

xml =<<EOT
<?xml version="1.0" encoding="utf-8"?>
<Hotel>
  <Descriptions>
    <Description>
      <Name>General Description</Name>
      <Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen.

There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color.

The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value>
    </Description>
    <Description>
      <Name>LocationDescription</Name>
      <Value>Aberdeen's premier four star hotel located in the city centre just off Union Street and the main business and entertainment areas. Within 10 minutes journey of Aberdeen Railway Station and only 10-20 minutes journey from International Airport.</Value>
    </Description>
  </Descriptions>
</Hotel>
EOT

require 'nokogiri'

doc = Nokogiri::XML(xml)
puts doc.search('Value').map{ |n| n.text }

输出样本:

  

Copthorne Aberdeen酒店享有靠近几家酒吧,餐馆和其他娱乐场所的位置。这家位于阿伯丁(Aberdeen)的酒店坐落在城市的西区(West End),距离许多观光景点或一天购物场所约有1英里(1.6公里)。阿伯丁国际机场距离阿伯丁的Copthorne酒店约10英里。

     

Copthorne Aberdeen Hotel酒店共有89间客房。每间客房均提供直拨电话,熨裤机,咖啡和茶设施以及带浴袍和洗浴用品的私人浴室。房间颜色浅。

     

Hotel Copthorne Aberdeen酒店为客人提供一间餐厅,供客人在正式的环境中享用餐点。对于更悠闲的客人,客人可以在酒店的酒吧享用饮品和便餐。这家酒店提供商务服务,并设有现场会议室。酒店还为乘坐私家车抵达的客人提供安全的停车设施。   Aberdeen酒店是一家四星级酒店,位于市中心,毗邻联合街和主要的商业和娱乐区。距离阿伯丁火车站仅10分钟路程,距国际机场仅10-20分钟路程。

这有意地仅在Value个节点之后。修改样本以获取图像节点也很简单。

现在,有几个问题:为什么要使用SAX模式?传入的XML是否比合理地放入主机的RAM更大?如果没有,请使用DOM,因为它更容易使用。

当我第一次运行时,Ruby告诉我invalid multibyte char (US-ASCII),这意味着它不喜欢XML中的某些东西。我通过添加# encoding行来修复此问题。我正在使用Ruby 1.9.2,这使得处理这些事情变得更容易。

我正在使用CSS访问器进行搜索。 Nokogiri允许使用XPath和CSS,因此您可以随心所欲地放纵XML解析之心的愿望。

答案 1 :(得分:0)

我遇到了类似的问题,这是实际的解释:

def characters(string)
    @characters = string
end

实际应该是这样的:

def start_element(element, attributes = [])     
  #...(other stuff)...

  # Reset/initialize @characters
  @characters = ""
end

def characters(string)
    @characters += string
end

基本原理是标签的内容实际上可以分成多个文本节点,如下所述:http://nokogiri.org/Nokogiri/XML/SAX/Document.html

  

给定一个连续的字符串,可能会多次调用此方法。

仅捕获文本正文的最后一段,因为每次遇到文本节点(即调用characters方法)时,它都会替换@characters的内容而不是附加到它。