将列表解析为没有循环的多维数组

时间:2012-09-07 20:58:48

标签: ruby parsing nokogiri

我正在使用Ruby和Nokogiri来解析HTML源代码,并以下列格式将项目列表为可识别的模式:

<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
    <b>ParameterB</b>
    <i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>

<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
    <b>ParameterB</b>
    <i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>

等多次。

如何在以下结构中创建具有所需参数的多维数组?

myarray = []
mystuff = Struct.new(:ParameterA, :ParameterB, :ParameterC)

无法找到我可以在这里运行的循环类型,以及如何避免解析无用的东西。

2 个答案:

答案 0 :(得分:1)

我能用regexp解决这个问题,它为我提供了正确的多维数组作为输出:

[["ParameterA", "ParameterB", "Possible ParameterC"], ["ParameterA", "ParameterB", "Possible ParameterC"]]

工作代码:

str = <<EOF
<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
<b>ParameterB</b>
    <i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>

<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
    <b>ParameterB</b>
    <i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>
EOF

m = str.scan(/<small [^>]+>([^<]+)<.*?<b>([^<]+)<\/b>\s+<i>([^<]+)<\/i>/m)
puts m.inspect

答案 1 :(得分:0)

我会用这样的东西:

require 'nokogiri'
require 'ostruct'

doc = Nokogiri::HTML(<<EOT)
<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
    <b>ParameterB</b>
    <i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>

<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
    <b>ParameterB</b>
    <i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>
EOT

mystuff = doc.search('small.y').map { |span_y|
  [
    span_y.content,
    span_y.next_element.at('b').content,
    span_y.next_element.at('i') ? span_y.next_element.at('i').content : nil
  ]
}

pp mystuff

看起来像:

[
  [
    "ParameterA",
    "ParameterB",
    "Possible ParameterC"
  ],
  [
    "ParameterA",
    "ParameterB",
    "Possible ParameterC"
  ]
]