抓取XML节点+ Nokogiri和xpath的文本值

时间:2013-12-26 18:07:13

标签: ruby-on-rails ruby xml xpath nokogiri

我已经构建了一个rake文件,将我抓取的所有信息都插入到我的数据库中。这一切都正常,但我的键的值没有填充任何数据。我可能错误地使我的at_xpath调用?我将在下面发布一个例子 -

information = {
            "street_address" => property.at_xpath("/Address/AddressLine1/text()"),
            "city" => property.at_xpath("/Address/City/text()"),
            "zipcode" => property.at_xpath("/Address/PostalCode/text()"),
            "short_description" => property.at_xpath("/Information/ShortDescription/text()"),
            "long_description" => property.at_xpath("Information/LongDescription/text()"),
            "rent" => property.at_xpath("/Information/Rents/StandardRent/text()"),
            "application_fee" => property.at_xpath("/Fee/ApplicationFee/text()"),
            "bedrooms" => property.at_xpath("/Floorplan/Room[@RoomType='Bedroom']/Count/text()"),
            "bathrooms" => property.at_xpath("/Floorplan/Room[@RoomType='Bathroom']/Count/text()"),
            "bathrooms" => property.at_xpath("/ILS_Unit/Availability/VacancyClass/text()")
        }

除了将数据放入上面列出的哈希中的实际值空间之外,我知道一切都完美无缺。我也知道nokogiri和xpath工作正常,因为我将s的数量从33,000+缩小到1,068。

任何指导都会非常感激!谢谢:))

=========================更新===================== =======

我认为看到整个循环可能有助于增加清晰度 -

doc.xpath("//Property/PropertyID/Identification[@OrganizationName='northsteppe']").each do |property|

        # GATHER EACH PROPERTY'S INFORMATION
        information = {
            "street_address" => property.at_xpath("/Address/AddressLine1/text()"),
            "city" => property.at_xpath("/Address/City/text()"),
            "zipcode" => property.at_xpath("/Address/PostalCode/text()"),
            "short_description" => property.at_xpath("/Information/ShortDescription/text()"),
            "long_description" => property.at_xpath("Information/LongDescription/text()"),
            "rent" => property.at_xpath("/Information/Rents/StandardRent/text()"),
            "application_fee" => property.at_xpath("/Fee/ApplicationFee/text()"),
            "bedrooms" => property.at_xpath("/Floorplan/Room[@RoomType='Bedroom']/Count/text()"),
            "bathrooms" => property.at_xpath("/Floorplan/Room[@RoomType='Bathroom']/Count/text()"),
            "bathrooms" => property.at_xpath("/ILS_Unit/Availability/VacancyClass/text()")
        }


        # CREATE NEW PROPERTY WITH INFORMATION HASH CREATED ABOVE
        if Property.create!(information)
            puts "yay!"
        else
            puts "oh no! this sucks!"
        end

    end # ENDS XPATH EACH LOOP

============================另一个更新================= =========

所以我尝试使用“/ inner_text()”在每个at_xpath路径的末尾交换“/ text()”并收到以下错误 -

耙子流产了! 表达式无效:/ Address / AddressLine1 / inner_text()

然后我尝试将“at_xpath”调用切换为“at_css”调用并执行类似的操作 -

"street_address" => property.at_css(".AddressLine1").text

但收到以下错误 -

耙子流产了! nil的未定义方法`text':NilClass

=============================更新显示XML ============== =============

<Property IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
  <PropertyID>
    <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="northsteppe" IDType="property"/>
    <Identification IDValue="6e1e61523972d5f0e260e3d38eb488337424f21e" OrganizationName="northsteppe" IDType="Company"/>
    <MarketingName>Spacious House Central Campus OSU, available fall</MarketingName>
    <WebSite>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</WebSite>
    <Address AddressType="property">
      <Description>Address of Available Listing</Description>
      <AddressLine1>1689 N 4th St </AddressLine1>
      <City>Columbus</City>
      <State>OH</State>
      <PostalCode>43201</PostalCode>
      <Country>US</Country>
    </Address>
    <Phone PhoneType="office">
      <PhoneNumber>(614) 299-4110</PhoneNumber>
    </Phone>
    <Email>northsteppe.nsr@gmail.com</Email>
  </PropertyID>
  <ILS_Identification ILS_IdentificationType="Apartment" RentalType="Market Rate">
    <Latitude>39.997694</Latitude>
    <Longitude>-82.99903</Longitude>
    <LastUpdate Month="11" Day="11" Year="2013"/>
  </ILS_Identification>
  <Information>
    <StructureType>Standard</StructureType>
    <UnitCount>1</UnitCount>
    <ShortDescription>Spacious House Central Campus OSU, available fall</ShortDescription>
    <LongDescription>One of our favorites! This great house is perfect for students or a single family. With huge living and sleeping rooms, there is plenty of space. The kitchen is totally modernized with new appliances, and the bathroom has been updated. Natural woodwork and brick accents are seen within the house, and the decorative mantles. Ceiling fans and mini-blinds are included, as well as a FREE stack washer and dryer. The front and side deck. On site parking available.</LongDescription>
    <Rents>
      <StandardRent>2000.00</StandardRent>
    </Rents>
    <PropertyAvailabilityURL>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</PropertyAvailabilityURL>
  </Information>
  <Fee>
    <ProrateType>Standard</ProrateType>
    <LateType>Standard</LateType>
    <LatePercent>0</LatePercent>
    <LateMinFee>0</LateMinFee>
    <LateFeePerDay>0</LateFeePerDay>
    <NonRefundableHoldFee>0</NonRefundableHoldFee>
    <AdminFee>0</AdminFee>
    <ApplicationFee>30.00</ApplicationFee>
    <BrokerFee>0</BrokerFee>
  </Fee>
  <Deposit DepositType="Security Deposit">
    <Amount AmountType="Actual">
      <ValueRange Exact="2000.00" Currency="USD"/>
    </Amount>
  </Deposit>
  <Policy>
    <Pet Allowed="false"/>
  </Policy>
  <Phase IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
    <Name/>
    <Description/>
    <UnitCount>1</UnitCount>
    <RentableUnits>1</RentableUnits>
    <TotalSquareFeet>0</TotalSquareFeet>
    <RentableSquareFeet>0</RentableSquareFeet>
  </Phase>
  <Building IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
    <Name/>
    <Description/>
    <UnitCount>1</UnitCount>
    <SquareFeet>0</SquareFeet>
  </Building>
  <Floorplan IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
    <Name/>
    <UnitCount>1</UnitCount>
    <Room RoomType="Bedroom">
      <Count>4</Count>
      <Comment/>
    </Room>
    <Room RoomType="Bathroom">
      <Count>1</Count>
      <Comment/>
    </Room>
    <SquareFeet Min="0" Max="0"/>
    <MarketRent Min="2000" Max="2000"/>
    <EffectiveRent Min="2000" Max="2000"/>
  </Floorplan>
  <ILS_Unit IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8">
    <Units>
      <Unit>
        <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="UL Portfolio"/>
        <MarketingName>Spacious House Central Campus OSU, available fall</MarketingName>
        <UnitBedrooms>4</UnitBedrooms>
        <UnitBathrooms>1.0</UnitBathrooms>
        <MinSquareFeet>0</MinSquareFeet>
        <MaxSquareFeet>0</MaxSquareFeet>
        <SquareFootType>internal</SquareFootType>
        <UnitRent>2000.00</UnitRent>
        <MarketRent>2000.00</MarketRent>
        <Address AddressType="property">
          <AddressLine1>1689 N 4th St </AddressLine1>
          <City>Columbus</City>
          <PostalCode>43201</PostalCode>
          <Country>US</Country>
        </Address>
      </Unit>
    </Units>
    <Availability>
      <VacateDate Month="7" Day="23" Year="2014"/>
      <VacancyClass>Unoccupied</VacancyClass>
      <MadeReadyDate Month="7" Day="23" Year="2014"/>
    </Availability>
    <Amenity AmenityType="Other">
      <Description>All new stainless steel appliances!  Refinished hardwood floors</Description>
    </Amenity>
    <Amenity AmenityType="Other">
      <Description>Ceramic tile</Description>
    </Amenity>
    <Amenity AmenityType="Other">
      <Description>Ceiling fans</Description>
    </Amenity>
    <Amenity AmenityType="Other">
      <Description>Wrap-around porch</Description>
    </Amenity>
    <Amenity AmenityType="Dryer">
      <Description>Free Washer and Dryer</Description>
    </Amenity>
    <Amenity AmenityType="Washer">
      <Description>Free Washer and Dryer</Description>
    </Amenity>
    <Amenity AmenityType="Other">
      <Description>off-street parking available</Description>
    </Amenity>
  </ILS_Unit>
  <File Active="true" FileID="820982141">
    <FileType>Photo</FileType>
    <Description>Unit Photo</Description>
    <Name/>
    <Caption/>
    <Format>image/jpeg</Format>
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/31077069-6e81-4373-8a89-508c57585543/medium.jpg</Src>
    <Width>360</Width>
    <Height>300</Height>
    <Rank>1</Rank>
  </File>
  <File Active="true" FileID="820982145">
    <FileType>Photo</FileType>
    <Description>Unit Photo</Description>
    <Name/>
    <Caption/>
    <Format>image/jpeg</Format>
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/84e1be40-96fd-4717-b75d-09b39231a762/medium.jpg</Src>
    <Width>350</Width>
    <Height>265</Height>
    <Rank>2</Rank>
  </File>
  <File Active="true" FileID="820982149">
    <FileType>Photo</FileType>
    <Description>Unit Photo</Description>
    <Name/>
    <Caption/>
    <Format>image/jpeg</Format>
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/cd419635-c37f-4676-a43e-c72671a2a748/medium.jpg</Src>
    <Width>350</Width>
    <Height>265</Height>
    <Rank>3</Rank>
  </File>
  <File Active="true" FileID="820982152">
    <FileType>Photo</FileType>
    <Description>Unit Photo</Description>
    <Name/>
    <Caption/>
    <Format>image/jpeg</Format>
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/6b68dbd5-2cde-477c-99d7-3ca33f03cce8/medium.jpg</Src>
    <Width>350</Width>
    <Height>265</Height>
    <Rank>4</Rank>
  </File>
  <File Active="true" FileID="820982155">
    <FileType>Photo</FileType>
    <Description>Unit Photo</Description>
    <Name/>
    <Caption/>
    <Format>image/jpeg</Format>
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/17b6c7c0-686c-4e46-865b-11d80744354a/medium.jpg</Src>
    <Width>350</Width>
    <Height>265</Height>
    <Rank>5</Rank>
  </File>
  <File Active="true" FileID="820982157">
    <FileType>Photo</FileType>
    <Description>Unit Photo</Description>
    <Name/>
    <Caption/>
    <Format>image/jpeg</Format>
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/3545ac8b-471f-404a-94b2-fcd00dd16e25/medium.jpg</Src>
    <Width>350</Width>
    <Height>265</Height>
    <Rank>6</Rank>
  </File>
  <File Active="true" FileID="820982160">
    <FileType>Photo</FileType>
    <Description>Unit Photo</Description>
    <Name/>
    <Caption/>
    <Format>image/jpeg</Format>
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/02471172-2183-4bf1-a3d7-33415f902c1c/medium.jpg</Src>
    <Width>350</Width>
    <Height>265</Height>
    <Rank>7</Rank>
  </File>
</Property>

2 个答案:

答案 0 :(得分:1)

在你的循环中你做:

doc.xpath("//Property/PropertyID/Identification[@OrganizationName='northsteppe']").each do |property|

然后,为了您的价值观,您可以执行以下操作:

property.at_xpath("/Address/AddressLine1/text()")

您不能将/Address/AddressLine1/text()property相对于XPath使用。

Nokogiri将搜索/Address/AddressLine1/text(),这意味着从绝对路径开始,该路径将从文档顶部/开始,找到紧靠其下方的Address节点,找到它下面的AddressLine1节点....

改为使用:

Address/AddressLine1/text()

这意味着将 relative 搜索到property并生成完整的XPath:

//Property/PropertyID/Identification[@OrganizationName='northsteppe']/Address/AddressLine1/text()

查看您添加的XML ...

您想要的路径不存在。在PRY中看着它:

[16] (pry) main: 0> puts doc.xpath("//Property/PropertyID/Identification[@OrganizationName='northsteppe']").to_xml
<Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="northsteppe" IDType="property"/><Identification IDValue="6e1e61523972d5f0e260e3d38eb488337424f21e" OrganizationName="northsteppe" IDType="Company"/>

property个节点都没有子节点。只有property的节点存在,因此您要查找的所有值(子节点)都不存在。

相反,您似乎想要找到Property节点并向下工作:

答案 1 :(得分:1)

你的第一个XPath太深了。它返回一个需要PropertyID的标识。试试这个:

doc.xpath("//Property/PropertyID[ Identification/@OrganizationName = 'northsteppe' ]").each do |property|
    # GATHER EACH PROPERTY'S INFORMATION
    information = {
        "street_address" => property.at_xpath("Address/AddressLine1/text()").to_s,
        "city" => property.at_xpath("Address/City/text()").to_s,
        "zipcode" => property.at_xpath("Address/PostalCode/text()").to_s
        }
    p information
end