如何使用Ruby和Nokogiri来解析XML

时间:2016-10-13 13:18:34

标签: ruby xml nokogiri

此文档是防火墙配置的输出。我正在尝试构建一个防火墙规则的哈希。我稍后会将这些数据输出到CSV / console /我需要的任何内容:

<table index="44" title=" from PUBLIC to DMZ administrative service rules on Firewall01" ref="FILTER.BLACKLIST">
  <headings>
    <heading>Rule</heading>
    <heading>Action</heading>
    <heading>Source</heading>
    <heading>Destination</heading>
    <heading>Service</heading>
    <heading>Log</heading>
  </headings>
  <tablebody>
    <tablerow>
      <tablecell><item>test_inbound</item></tablecell>
      <tablecell><item>Allow</item></tablecell>
      <tablecell><item gotoref="CONFIG.3.452">[Group] test_b2_group</item></tablecell>
      <tablecell><item>[Host] Any</item></tablecell>
      <tablecell><item>[Host] Any</item></tablecell>
      <tablecell><item>Yes</item></tablecell>
    </tablerow>
    <tablerow>
      <tablecell><item>host02_inbound</item></tablecell>
      <tablecell><item>Allow</item></tablecell>
      <tablecell><item gotoref="CONFIG.3.447">[Group] host02_group</item></tablecell>
      <tablecell><item>[Host] Any</item></tablecell>
      <tablecell><item>[Host] Any</item></tablecell>
      <tablecell><item>Yes</item></tablecell>
    </tablerow>
    <tablerow>
      <tablecell><item>randomhost</item></tablecell>
      <tablecell><item>Allow</item></tablecell>
      **<tablecell><item gotoref="CONFIG.3.383">[Group] Host_group_2</item><item gotoref="CONFIG.3.382">[Group] another_server</item></tablecell>**
      <tablecell><item gotoref="CONFIG.3.510">[Group] crazy_application</item><item gotoref="CONFIG.3.511">[Group] internal_app</item><item gotoref="CONFIG.3.525">[Group] online_application</item></tablecell>
      <tablecell><item gotoref="CONFIG.3.783">[Group] junos-https</item></tablecell>
      <tablecell><item>No</item></tablecell>
    </tablerow>
  </tablebody>
</table>

我们有列标题和三个防火墙规则。

这是我的代码:

#!/usr/bin/env ruby

require 'nokogiri'
require 'csv'

fwpol = File.open(ARGV[0]) { |f| Nokogiri::XML(f) }
rule_array = []

fwpol.xpath('./table/tablebody/tablerow').each do |item|
  rules = {}

   rules[:name]   = item.xpath('./tablecell/item')[0].text
   rules[:action] = item.xpath('./tablecell/item')[1].text
   rules[:source] = item.xpath('./tablecell/item')[2].text
   rule_array << rules
end

puts rule_array

前两个哈希条目:name:action工作正常,因为这些字段中只有一个值。

如果我运行代码,则不会在有多个值的地方打印。粗体的XML行显示了我所指的内容。我需要以某种方式迭代这些值,但到目前为止,我的尝试都没有结果。

2 个答案:

答案 0 :(得分:2)

您可以通过以下方式将多个元素文本作为数组获取。

require 'nokogiri'
require 'csv'

fwpol = File.open(ARGV[0]) { |f| Nokogiri::XML(f) }
rule_array = []

fwpol.xpath('./table/tablebody/tablerow').each do |item|
  rules = {}

  rules[:name]   = item.xpath('./tablecell[1]/item').text
  rules[:action] = item.xpath('./tablecell[2]/item').text
  rules[:source] = item.xpath('./tablecell[3]/item').map(&:text)
  rule_array << rules
end

puts rule_array

输出就在这里。

{:name=>"test_inbound", :action=>"Allow", :source=>["[Group] test_b2_group"]}
{:name=>"host02_inbound", :action=>"Allow", :source=>["[Group] host02_group"]}
{:name=>"randomhost", :action=>"Allow", :source=>["[Group] Host_group_2", "[Group] another_server"]}

答案 1 :(得分:1)

我会做这样的事情:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<table index="44" title=" from PUBLIC to DMZ administrative service rules on Firewall01" ref="FILTER.BLACKLIST">
  <tablebody>
    <tablerow>
      <tablecell><item>test_inbound</item></tablecell>
      <tablecell><item>Allow</item></tablecell>
      <tablecell><item gotoref="CONFIG.3.452">[Group] test_b2_group</item></tablecell>
      <tablecell><item>[Host] Any</item></tablecell>
      <tablecell><item>[Host] Any</item></tablecell>
      <tablecell><item>Yes</item></tablecell>
    </tablerow>
    <tablerow>
      <tablecell><item>randomhost</item></tablecell>
      <tablecell><item>Allow</item></tablecell>
      <tablecell><item gotoref="CONFIG.3.383">[Group] Host_group_2</item><item gotoref="CONFIG.3.382">[Group] another_server</item></tablecell>
      <tablecell><item gotoref="CONFIG.3.510">[Group] crazy_application</item><item gotoref="CONFIG.3.511">[Group] internal_app</item><item gotoref="CONFIG.3.525">[Group] online_application</item></tablecell>
      <tablecell><item gotoref="CONFIG.3.783">[Group] junos-https</item></tablecell>
      <tablecell><item>No</item></tablecell>
    </tablerow>
  </tablebody>
</table>
EOT

rule_array = doc.search('tablerow').map{ |row|
  name, action, source = row.search('tablecell')[0, 3].map{ |tc| tc.search('item').map(&:text) }

  {
    name: name,
    action: action,
    source: source
  }
}

其中,运行时会返回包含哈希数组的rule_array,其中最后一个包含两个item条目:

require 'ap'
ap rule_array

# >> [
# >>   [0] {
# >>     :name   => [
# >>       [0] "test_inbound"
# >>     ],
# >>     :action => [
# >>       [0] "Allow"
# >>     ],
# >>     :source => [
# >>       [0] "[Group] test_b2_group"
# >>     ]
# >>   },
# >>   [1] {
# >>     :name   => [
# >>       [0] "randomhost"
# >>     ],
# >>     :action => [
# >>       [0] "Allow"
# >>     ],
# >>     :source => [
# >>       [0] "[Group] Host_group_2",
# >>       [1] "[Group] another_server"
# >>     ]
# >>   }
# >> ]

注意:不要这样做:

fwpol = File.open(ARGV[0]) { |f| Nokogiri::XML(f) }

使用起来更简单:

fwpol = Nokogiri::XML(File.read(ARGV[0]))

而不是:

item.xpath('./tablecell/item')[0].text
item.xpath('./tablecell/item')[1].text
item.xpath('./tablecell/item')[2].text

只需找到tablecell标签一次,然后将所需的标签切片:[0, 3],然后迭代该小组。它更快,减少了代码的重复。

另请参阅“How to avoid joining all text from Nodes when scraping”。