我正在使用Ruby和Nokogiri来解析HTML源代码,并以下列格式将项目列表为可识别的模式:
<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
<b>ParameterB</b>
<i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>
<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
<b>ParameterB</b>
<i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>
等多次。
如何在以下结构中创建具有所需参数的多维数组?
myarray = []
mystuff = Struct.new(:ParameterA, :ParameterB, :ParameterC)
无法找到我可以在这里运行的循环类型,以及如何避免解析无用的东西。
答案 0 :(得分:1)
我能用regexp解决这个问题,它为我提供了正确的多维数组作为输出:
[["ParameterA", "ParameterB", "Possible ParameterC"], ["ParameterA", "ParameterB", "Possible ParameterC"]]
工作代码:
str = <<EOF
<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
<b>ParameterB</b>
<i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>
<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
<b>ParameterB</b>
<i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>
EOF
m = str.scan(/<small [^>]+>([^<]+)<.*?<b>([^<]+)<\/b>\s+<i>([^<]+)<\/i>/m)
puts m.inspect
答案 1 :(得分:0)
我会用这样的东西:
require 'nokogiri'
require 'ostruct'
doc = Nokogiri::HTML(<<EOT)
<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
<b>ParameterB</b>
<i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>
<span class="x">junk</span>
<small class="y">ParameterA</small>
<span class="z">
<b>ParameterB</b>
<i>Possible ParameterC</i>
</span>
<script type="text/javascript">useless stuff</script>
<object><noscript>other useless stuff</noscript></object>
EOT
mystuff = doc.search('small.y').map { |span_y|
[
span_y.content,
span_y.next_element.at('b').content,
span_y.next_element.at('i') ? span_y.next_element.at('i').content : nil
]
}
pp mystuff
看起来像:
[
[
"ParameterA",
"ParameterB",
"Possible ParameterC"
],
[
"ParameterA",
"ParameterB",
"Possible ParameterC"
]
]