转换使用Nokogiri :: XML的Rake任务使用Nokogiri:XML :: SAX

时间:2013-12-30 16:33:00

标签: ruby xml rake nokogiri sax

我有一个rake任务在我的本地计算机上运行得很好,在将我的应用程序部署到VPS之后,它不再让我运行任务了。

我使用 -

运行任务
RAILS_ENV=production bundle exec rake db:insert_properties

我得到的输出是 -

(in /home/deployer/apps/nsrosu/releases/20131230151646)
Killed

任何人都知道为什么会发生这种情况?我有双重和三重检查,我用来为rake任务提取数据的XML文件确实存在于正确的目录中。

此外,我尝试过,而不是使用存储在服务器上的文件,将其从存储在其他位置的外部源中提取出来,但是nokogiri说,当我以这种方式尝试时,该文件不存在。解决这些问题中的任何一个都是非常好的:)

此外,这是rake任务,如果有助于回答任何问题 -

# SET RAKE TASK NAMESPACE
namespace :db do
# RAKE TASK DESCRIPTION
desc "Fetch property information and insert it into the database"

# RAKE TASK NAME    
task :insert_properties => :environment do

    # REQUIRE LIBRARIES
    require 'nokogiri'
    require 'open-uri'

    # OPEN THE XML FILE
    mits_feed = File.open("app/assets/xml/mits.xml")

    # OUTPUT THE XML DOCUMENT
    doc = Nokogiri::XML(mits_feed)

    # FIND PROPERTIES OWNED BY NORTHSTEPPE AND CYCLE THORUGH THEM
    doc.xpath("//Property[PropertyID/Identification/@OrganizationName = 'northsteppe' ]").each do |property|

        # SET UP EMPTY IMAGES ARRAY
        @images =[]

        # INSERT EACH IMAGE INTO THE IMAGES ARRAY
        property.xpath("File").each do |image|
            @images << image.at_xpath("Src/text()").to_s
        end

        # SET UP EXMPTY AMENITIES ARRAY
        @amenities = []

        # INSERT EACH AMENITY DESCRIPTION INTO THE AMENITIES ARRAY
        property.xpath("ILS_Unit/Amenity").each do |image|
            @amenities << image.at_xpath("Description/text()").to_s
        end

        # GATHER EACH PROPERTY'S INFORMATION
        information = {
            "street_address" => property.at_xpath("PropertyID/Address/AddressLine1/text()").to_s,
            "city" => property.at_xpath("PropertyID/Address/City/text()").to_s,
            "zipcode" => property.at_xpath("PropertyID/Address/PostalCode/text()").to_s,
            "short_description" => property.at_xpath("PropertyID/MarketingName/text()").to_s,
            "long_description" => property.at_xpath("Information/LongDescription/text()").to_s,
            "rent" => property.at_xpath("Information/Rents/StandardRent/text()").to_s,
            "application_fee" => property.at_xpath("Fee/ApplicationFee/text()").to_s,
            "bedrooms" => property.at_xpath("Floorplan/Room[@RoomType='Bedroom']/Count/text()").to_s,
            "bathrooms" => property.at_xpath("Floorplan/Room[@RoomType='Bathroom']/Count/text()").to_s,
            "vacancy_status" => property.at_xpath("ILS_Unit/Availability/VacancyClass/text()").to_s,
            "month_available" => property.at_xpath("ILS_Unit/Availability/MadeReadyDate/@Month").to_s,
            "latitude" => property.at_xpath("ILS_Identification/Latitude/text()").to_s,
            "longitude" => property.at_xpath("ILS_Identification/Longitude/text()").to_s,
            "images" => @images,
            "amenities" => @amenities
        }

        # SHOW RAW DATA IN TERMINAL TO MAKE SURE EVERYTHING IS WORKING
        p information


        # CREATE NEW PROPERTY WITH INFORMATION HASH CREATED ABOVE
        if Property.create!(information)
            puts "yay!"
        else
            puts "oh no! this sucks!"
        end

    end # ENDS XPATH EACH LOOP

end # ENDS INSERT_PROPERTIES RAKE TASK

end # ENDS NAMESAPCE DECLARATION

================================更新=========== ======================

所以似乎最好的方法是通过SAX系统运行它,SAXMachine已准备好与Nokogiri一起工作,但这两种技术的文档非常糟糕。我希望能够就如何设置一个与上述任务完成相同的任务,但使用SAXMachine的任务有所指导。请:)

我在下面发布了一个XML条目示例 -

<Property IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
<PropertyID>
  <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="northsteppe" IDType="property"/>
  <Identification IDValue="6e1e61523972d5f0e260e3d38eb488337424f21e" OrganizationName="northsteppe" IDType="Company"/>
  <MarketingName>Spacious House Central Campus OSU, available fall</MarketingName>
  <WebSite>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</WebSite>
  <Address AddressType="property">
    <Description>Address of Available Listing</Description>
    <AddressLine1>1689 N 4th St </AddressLine1>
    <City>Columbus</City>
    <State>OH</State>
    <PostalCode>43201</PostalCode>
    <Country>US</Country>
  </Address>
  <Phone PhoneType="office">
    <PhoneNumber>(614) 299-4110</PhoneNumber>
  </Phone>
  <Email>northsteppe.nsr@gmail.com</Email>
</PropertyID>
<ILS_Identification ILS_IdentificationType="Apartment" RentalType="Market Rate">
  <Latitude>39.997694</Latitude>
  <Longitude>-82.99903</Longitude>
  <LastUpdate Month="11" Day="11" Year="2013"/>
</ILS_Identification>
<Information>
  <StructureType>Standard</StructureType>
  <UnitCount>1</UnitCount>
  <ShortDescription>Spacious House Central Campus OSU, available fall</ShortDescription>
  <LongDescription>One of our favorites! This great house is perfect for students or a single family. With huge living and sleeping rooms, there is plenty of space. The kitchen is totally modernized with new appliances, and the bathroom has been updated. Natural woodwork and brick accents are seen within the house, and the decorative mantles. Ceiling fans and mini-blinds are included, as well as a FREE stack washer and dryer. The front and side deck. On site parking available.</LongDescription>
  <Rents>
    <StandardRent>2000.00</StandardRent>
  </Rents>
  <PropertyAvailabilityURL>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</PropertyAvailabilityURL>
</Information>
<Fee>
  <ProrateType>Standard</ProrateType>
  <LateType>Standard</LateType>
  <LatePercent>0</LatePercent>
  <LateMinFee>0</LateMinFee>
  <LateFeePerDay>0</LateFeePerDay>
  <NonRefundableHoldFee>0</NonRefundableHoldFee>
  <AdminFee>0</AdminFee>
  <ApplicationFee>30.00</ApplicationFee>
  <BrokerFee>0</BrokerFee>
</Fee>
<Deposit DepositType="Security Deposit">
  <Amount AmountType="Actual">
    <ValueRange Exact="2000.00" Currency="USD"/>
  </Amount>
</Deposit>
<Policy>
  <Pet Allowed="false"/>
</Policy>
<Phase IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
  <Name/>
  <Description/>
  <UnitCount>1</UnitCount>
  <RentableUnits>1</RentableUnits>
  <TotalSquareFeet>0</TotalSquareFeet>
  <RentableSquareFeet>0</RentableSquareFeet>
</Phase>
<Building IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
  <Name/>
  <Description/>
  <UnitCount>1</UnitCount>
  <SquareFeet>0</SquareFeet>
</Building>
<Floorplan IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
  <Name/>
  <UnitCount>1</UnitCount>
  <Room RoomType="Bedroom">
    <Count>4</Count>
    <Comment/>
  </Room>
  <Room RoomType="Bathroom">
    <Count>1</Count>
    <Comment/>
  </Room>
  <SquareFeet Min="0" Max="0"/>
  <MarketRent Min="2000" Max="2000"/>
  <EffectiveRent Min="2000" Max="2000"/>
</Floorplan>
<ILS_Unit IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
  <Units>
    <Unit>
      <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="UL Portfolio"/>
      <MarketingName>Spacious House Central Campus OSU, available fall</MarketingName>
      <UnitBedrooms>4</UnitBedrooms>
      <UnitBathrooms>1.0</UnitBathrooms>
      <MinSquareFeet>0</MinSquareFeet>
      <MaxSquareFeet>0</MaxSquareFeet>
      <SquareFootType>internal</SquareFootType>
      <UnitRent>2000.00</UnitRent>
      <MarketRent>2000.00</MarketRent>
      <Address AddressType="property">
        <AddressLine1>1689 N 4th St </AddressLine1>
        <City>Columbus</City>
        <PostalCode>43201</PostalCode>
        <Country>US</Country>
      </Address>
    </Unit>
  </Units>
  <Availability>
    <VacateDate Month="7" Day="23" Year="2014"/>
    <VacancyClass>Unoccupied</VacancyClass>
    <MadeReadyDate Month="7" Day="23" Year="2014"/>
  </Availability>
  <Amenity AmenityType="Other">
    <Description>All new stainless steel appliances!  Refinished hardwood floors</Description>
  </Amenity>
  <Amenity AmenityType="Other">
    <Description>Ceramic tile</Description>
  </Amenity>
  <Amenity AmenityType="Other">
    <Description>Ceiling fans</Description>
  </Amenity>
  <Amenity AmenityType="Other">
    <Description>Wrap-around porch</Description>
  </Amenity>
  <Amenity AmenityType="Dryer">
    <Description>Free Washer and Dryer</Description>
  </Amenity>
  <Amenity AmenityType="Washer">
    <Description>Free Washer and Dryer</Description>
  </Amenity>
  <Amenity AmenityType="Other">
    <Description>off-street parking available</Description>
  </Amenity>
</ILS_Unit>
<File Active="true" FileID="820982141">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/31077069-6e81-4373-8a89-508c57585543/medium.jpg</Src>
  <Width>360</Width>
  <Height>300</Height>
  <Rank>1</Rank>
</File>
<File Active="true" FileID="820982145">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/84e1be40-96fd-4717-b75d-09b39231a762/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>2</Rank>
</File>
<File Active="true" FileID="820982149">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/cd419635-c37f-4676-a43e-c72671a2a748/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>3</Rank>
</File>
<File Active="true" FileID="820982152">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/6b68dbd5-2cde-477c-99d7-3ca33f03cce8/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>4</Rank>
</File>
<File Active="true" FileID="820982155">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/17b6c7c0-686c-4e46-865b-11d80744354a/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>5</Rank>
</File>
<File Active="true" FileID="820982157">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/3545ac8b-471f-404a-94b2-fcd00dd16e25/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>6</Rank>
</File>
<File Active="true" FileID="820982160">
  <FileType>Photo</FileType>
  <Description>Unit Photo</Description>
  <Name/>
  <Caption/>
  <Format>image/jpeg</Format>
  <Src>http://pa.cdn.appfolio.com/northsteppe/images/02471172-2183-4bf1-a3d7-33415f902c1c/medium.jpg</Src>
  <Width>350</Width>
  <Height>265</Height>
  <Rank>7</Rank>
</File>

2 个答案:

答案 0 :(得分:2)

你可能已经超过了你的VPS的分配资源限制,结果你的任务被杀死了。

用于改善rake任务的XML读取部分的内存占用的选项包括使用SAX或pull解析器而不是将整个文件加载到内存中。有关详细信息,请查看“How can I read a large XML file in Ruby with libxml-ruby?”。

答案 1 :(得分:1)

这是转换的开始,它应该足以让你前进。而且,这是未经测试的,自从我编写SAX代码以来已经很长时间了,所以要小心。

第一部分是清理原始代码,使其更像是我编写DOM代码:

require 'nokogiri'
require 'open-uri'

# doc = Nokogiri::XML(File.open("app/assets/xml/mits.xml"))

# doc.xpath("//Property/PropertyID/Identification/@OrganizationName = 'northsteppe' ]").each do |property|

#   images = property.xpath("File").map { |image|
#     image.at_xpath("Src/text()").to_s 
#   }

#   amenities = property.xpath("ILS_Unit/Amenity").map { |image|
#     image.at_xpath("Description/text()").to_s 
#   }

#   information = {
#     "street_address"    => property.at_xpath("PropertyID/Address/AddressLine1/text()").to_s,
#     "city"              => property.at_xpath("PropertyID/Address/City/text()").to_s,
#     "zipcode"           => property.at_xpath("PropertyID/Address/PostalCode/text()").to_s,
#     "short_description" => property.at_xpath("PropertyID/MarketingName/text()").to_s,
#     "long_description"  => property.at_xpath("Information/LongDescription/text()").to_s,
#     "rent"              => property.at_xpath("Information/Rents/StandardRent/text()").to_s,
#     "application_fee"   => property.at_xpath("Fee/ApplicationFee/text()").to_s,
#     "bedrooms"          => property.at_xpath("Floorplan/Room[@RoomType='Bedroom']/Count/text()").to_s,
#     "bathrooms"         => property.at_xpath("Floorplan/Room[@RoomType='Bathroom']/Count/text()").to_s,
#     "vacancy_status"    => property.at_xpath("ILS_Unit/Availability/VacancyClass/text()").to_s,
#     "month_available"   => property.at_xpath("ILS_Unit/Availability/MadeReadyDate/@Month").to_s,
#     "latitude"          => property.at_xpath("ILS_Identification/Latitude/text()").to_s,
#     "longitude"         => property.at_xpath("ILS_Identification/Longitude/text()").to_s,
#     "images"            => images,
#     "amenities"         => amenities
#   }

#   p information


#   if Property.create!(information)
#     puts "yay!"
#   else
#     puts "oh no! this sucks!"
#   end

# end

这是SAX代码的开始:

class MitsDocument < Nokogiri::XML::SAX::Document

我定义了一些类变量来跟踪imagesamenities

  @@images = []
  @@amenities = []

每当Nokogiri进入标签时,它都会调用start_element

  def start_element(tag_name, attributes=[])

    tag_attributes = Hash[*attributes]

    # set up some flags to track the current state...
    @in_property                 = true if (tag_name == 'Property')
    @in_property_id              = true if (tag_name == 'PropertyID')

    @in_identification           = true if (tag_name == 'Identification')
    @organization_is_northsteppe = true if (tag_attributes['OrganizationName'] == 'northsteppe')

    @in_file                     = true if (tag_name == 'File')
    @in_source                   = true if (tag_name == 'Src')

    @in_ils_unit                 = true if (tag_name == 'ILS_Unit')
    @in_amentiy                  = true if (tag_name == 'Amenity')
    @in_description              = true if (tag_name == 'Description')

  end

遇到文本节点时characters被调用。如果Nokogiri下降得足够远,我们可以通过测试某些标志组合来检查,文本将被推送到适当的数组:

  def characters(str)
    if [@in_file, @in_source].all?
      @@images << str
    end

    if [@in_ils_unit, @in_amentiy, @in_description].all?
      @@amenities << str
    end
  end

当Nokogiri退出节点时,它会使用标记名称调用end_element

  def end_element(name)
    @in_property                 = false if (tag_name == 'Property')
    @in_property_id              = false if (tag_name == 'PropertyID')

    @in_identification           = false if (tag_name == 'Identification')
    @organization_is_northsteppe = false if (tag_name == 'Identification')

如果读取Nokogiri退出特定标签,则应该对其子标签的聚合结果执行某些操作。这是如何处理被跟踪的类变量:

    if (tag_name == 'File')

      # do something with @@images

      @in_file = false 
    end
    @in_source = false if (tag_name == 'Src')

    if (tag_name == 'ILS_Unit')

      # do something with @@amenities

      @in_ils_unit = false 
    end
    @in_amentiy     = false if (tag_name == 'Amenity')
    @in_description = false if (tag_name == 'Description')

  end

您要清理数据库连接或文件,或者在文档结束时存储内容的位置:

  def end_document
  end
end

parser = Nokogiri::XML::SAX::Parser.new(MitsDocument.new)

# Feed the parser some XML
parser.parse(File.open("app/assets/xml/mits.xml"))

已经很晚了,我累了,所以这可能不对,但它看起来像是开始。您需要添加代码来处理跟踪information哈希中的标记,但这与上面的内容类似。我也可能转而使用case/when语句而不是if语句列表来尝试使标志的设置/清除更加干净,但就像我说的那样,我很累所以我现在不打扰。

在“真正的铁”与在虚拟机上工作时,您可能能够为其添加足够的RAM来处理加载7M +行XML文件。没有整个文件,我无法猜测在现实生活中会占用多少RAM,但这有点不合时宜。 SAX旨在处理任意大小的文件,因为SAX处理确实将整个XML分解为更容易处理的更小块。

DOM对大多数事物都很方便;很多时候我们看到XML表示单个对象,或者来自数据库的小提取。我猜你正在处理大型,大型,提取,甚至是完整的数据库转储。在这种情况下,DOM并不是真正的工具,但SAX是。

Nokogiri有能力处理两者是件好事。