使用Nokogiri将XML解析为CSV

时间:2014-04-07 17:54:30

标签: ruby xml csv nokogiri

我试图找出如何从URL返回的XML中获取Make和Model并将它们放入CSV中。以下是从URL返回的XML:

<VINResult xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://basicvalues.pentondata.com/">
  <Vehicles>
    <Vehicle>
      <ID>131497</ID>
      <Product>TRUCK</Product>
      <Year>1993</Year>
      <Make>Freightliner</Make>
      <Model>FLD12064T</Model>
      <Description>120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes & Power Steering 6x4 (SBA - Set Back Axle)</Description>
    </Vehicle>
    <Vehicle>
      <ID>131497</ID>
      <Product>TRUCK</Product>
      <Year>1993</Year>
      <Make>Freightliner</Make>
      <Model>FLD12064T</Model>
      <Description>120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes & Power Steering 6x4 (SBA - Set Back Axle)</Description>
    </Vehicle>
  </Vehicles>
  <Errors/>
  <InvalidVINMsg/>
</VINResult>

这是我到目前为止的代码:

require 'csv'
require 'rubygems'
require 'nokogiri'
require 'open-uri'

    vincarriercsv = 'vincarrier.csv'
    vindetails = 'vindetails.csv'
    vinurl =  'http://redacted/LookUp_VIN?key=redacted&vin='

    CSV.open(vindetails, "wb") do |details|
        CSV.foreach(vincarriercsv) do |row|
            vinxml = Nokogiri::HTML(vinurl + row[1])
                make = vinxml.xpath('//VINResult//Vehicles//Vehicle//Make').text
                model = vinxml.xpath('//VINResult//Vehicles//Vehicle//Model').text
            details << [ row[0], row[1], make, model ]
        end
    end

由于某种原因,URL会返回两次相同的数据,但我只需要第一个结果。到目前为止,我尝试从XML中获取Make和Model失败了......任何想法?

2 个答案:

答案 0 :(得分:1)

以下是获取品牌和型号数据的方法。如何将其转换为CSV留给您:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<VINResult xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://basicvalues.pentondata.com/">
  <Vehicles>
    <Vehicle>
      <ID>131497</ID>
      <Product>TRUCK</Product>
      <Year>1993</Year>
      <Make>Freightliner</Make>
      <Model>FLD12064T</Model>
      <Description>120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes & Power Steering 6x4 (SBA - Set Back Axle)</Description>
    </Vehicle>
    <Vehicle>
      <ID>131497</ID>
      <Product>TRUCK</Product>
      <Year>1993</Year>
      <Make>Freightliner</Make>
      <Model>FLD12064T</Model>
      <Description>120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes & Power Steering 6x4 (SBA - Set Back Axle)</Description>
    </Vehicle>
  </Vehicles>
  <Errors/>
  <InvalidVINMsg/>
</VINResult>
EOT

vehicle_make_and_models = doc.search('Vehicle').map{ |vehicle|
    [
      'make', vehicle.at('Make').content,
      'model', vehicle.at('Model').content
    ]
  }

这导致:

vehicle_make_and_models # => [["make", "Freightliner", "model", "FLD12064T"], ["make", "Freightliner", "model", "FLD12064T"]]

如果您不想要字段名称:

vehicle_make_and_models = doc.search('Vehicle').map{ |vehicle|
  [
    vehicle.at('Make').content,
    vehicle.at('Model').content
  ]
}

vehicle_make_and_models # => [["Freightliner", "FLD12064T"], ["Freightliner", "FLD12064T"]]

注意:您拥有XML,而不是HTML。不要以为Nokogiri对待它们是相同的,或者差别是微不足道的。 Nokogiri严格解析XML,因为XML是一个严格的标准。

我使用CSS选择器,除非我绝对必须使用XPath。 CSS在大多数情况下会产生更清晰的选择器,从而使代码更容易阅读。

vinxml.xpath('//VINResult//Vehicles//Vehicle//Make').text不起作用,因为//表示“从文档的顶部开始”。每次遇到Nokogiri从顶部开始,向下搜索,并找到所有匹配的节点。 xpath将所有匹配的节点作为NodeSet返回,而不仅仅是特定节点,text将返回NodeSet中所有节点的文本,从而产生文本的连接字符串,这可能不是什么你想要的。

我更喜欢使用search代替xpathcss。它像其他两个一样返回一个NodeSet,但它也允许我们使用CSS或XPath选择器。如果您的特定选择器不明确并且可以解释为CSS或XPath,那么您可以使用显式表单。同样,您可以使用atxpath_atcss_at来查找第一个匹配的节点,该节点相当于search('foo').first

答案 1 :(得分:0)

您还可以执行以下操作,将Array中的所有车辆和所有车辆属性放入Hash

require 'nokogiri'
doc = Nokogiri::XML(open(YOUR_XML_FILE))
vehicles = doc.search("Vehicle").map do |vehicle|
  Hash[
    vehicle.children.map do |child|
      [child.name, child.text] unless child.text.chomp.strip == ""
    end.compact
  ]
end
#=>[{"ID"=>"131497", "Product"=>"TRUCK", "Year"=>"1993", "Make"=>"Freightliner", "Model"=>"FLD12064T", "Description"=>"120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes  Power Steering 6x4 (SBA - Set Back Axle)"}, {"ID"=>"131497", "Product"=>"TRUCK", "Year"=>"1993", "Make"=>"Freightliner", "Model"=>"FLD12064T", "Description"=>"120'' BBC Alum Air Cond Long Conv. (SBA) Tractor w/48'' Sleeper Air Brakes  Power Steering 6x4 (SBA - Set Back Axle)"}]

然后您可以访问单个车辆的所有属性,即

vehicles.first["ID"]
#=> "131497"
vehicles.first["Year"]
#=> "1993"