我有这个代码试图转到一个URL并将'li'元素解析成一个数组。但是,在尝试解析不在“b”标记中的任何内容时,我遇到了一个问题。
代码:
url = '(some URL)'
page = Nokogiri::HTML(open(url))
csv = CSV.open("/tmp/output.csv", 'w')
page.search('//li[not(@id) and not(@class)]').each do |row|
arr = []
row.search('b').each do |cell|
arr << cell.text
end
csv << arr
pp arr
end
HTML:
<li><b>The Company Name</b><br>
The Street<br>
The City,
The State
The Zipcode<br><br>
</li>
我想解析所有元素,以便输出如下:
["The Company Name", "The Street", "The City", "The State", "The Zip Code"],
["The Company Name", "The Street", "The City", "The State", "The Zip Code"],
["The Company Name", "The Street", "The City", "The State", "The Zip Code"]
答案 0 :(得分:1)
require 'nokogiri'
def main
output = []
page = File.open("parse.html") {|f| Nokogiri::HTML(f)}
page.search("//li[not(@id) and not (@class)]").each do |row|
arr = []
result = row.text
result.each_line { |l|
if l.strip.length > 0
arr << l.strip
end
}
output << arr
end
print output
end
if __FILE__ == $PROGRAM_NAME
main()
end
答案 1 :(得分:0)
我最终找到了自己问题的解决方案,所以如果有人有兴趣,我只需更改
row.search('b').each do |cell|
成:
row.search('text()'.each do |cell|
我也改变了
arr << cell.text
成:
arr << cell.text.gsub("\n", '').gsub("\r", '')
以删除输出中存在的所有\ n和\ r。
答案 2 :(得分:0)
根据您的HTML,我会这样做:
{{1}}