使用Ruby样式计算html标签(注入,块,每个...)

时间:2014-11-06 12:10:18

标签: ruby

我想计算某个页面中几个html标签的出现次数。我可以用经典的方式做到这一点,但我想用Ruby方式做。

这就是我所做的,但它不是添加每个部分计数,而是创建一个包含列表元素的字符串:

tags = [ '<img>', '<script>', '<applet>', '<video>', '<audio>' ]
weight = tags.each { |tag| web.to_s.scan(/#{tag}/).length }.inject(:+)

任何提示?

修改

def browse startpage, depth, block
    if depth > 0
        begin 
            web = open(startpage).read
            block.call startpage, web
        rescue
            return
        end
        links = URI.extract(web)
        links.each { |link| browse link, depth-1, block } 
    end
end

browse("https://www.youtube.com/", 2, lambda { |page_name, web|
    tags = [ '<img>', '<script>', '<applet>', '<video>', '<audio>' ]
    web.force_encoding 'utf-8'
    parsed_string = Nokogiri::HTML(web)
    weight = tags.each_with_object(Hash.new(0)) do |tag, hash|
      occurrences = parsed_string.xpath("//#{tag.gsub(/[<>]/, '')}").length
      hash[tag] = occurrences
    end
    puts "Page weight for #{web.base_uri} = #{weight}"
})

2 个答案:

答案 0 :(得分:0)

这是解决问题的唯一方法:

web = "<audio> <audio> <video>" # I guess 'web' is other than a string in your example, so the need for to_s below
tags = [ '<img>', '<script>', '<applet>', '<video>', '<audio>' ]

tag_occurrences = tags.each_with_object(Hash.new(0)) do |tag, hash|
  occurrences = web.to_s.scan(/#{tag}/).length
  hash[tag] = occurrences
end

p tag_occurrences #=> {"<img>"=>0, "<script>"=>0, "<applet>"=>0, "<video>"=>1, "<audio>"=>2}

不建议您使用正则表达式来匹配标记。更好的方法是使用像Nokogiri这样的东西来计算标签:

require 'nokogiri'
web = "<audio> <audio> <video>" 
parsed_string = Nokogiri::HTML(web.to_s) #using to_s because I'm assuming web isn't an actual string in your code
tags = [ '<img>', '<script>', '<applet>', '<video>', '<audio>' ]

tag_occurrences = tags.each_with_object(Hash.new(0)) do |tag, hash|
  occurrences = parsed_string.xpath("//#{tag.gsub(/[<>]/, '')}").length
  hash[tag] = occurrences
end

p tag_occurrences #=> {"<img>"=>0, "<script>"=>0, "<applet>"=>0, "<video>"=>1, "<audio>"=>2}

关于您的评论,我已在YouTube上使用此功能(使用我的第二段代码处理数据)并获得:

require 'open-uri'
web = open('http://youtube.com').read
# the code above to parse web using Nokogiri
p tag_occurrences #=> {"<img>"=>151, "<script>"=>13, "<applet>"=>0, "<video>"=>0, "<audio>"=>0}

答案 1 :(得分:0)

我会traverse文档一次,计算节点名称:

doc = Nokogiri::HTML(open('https://www.youtube.com/'))
tags_count = Hash.new(0)
doc.traverse { |node| tags_count[node.name] += 1 }
tags_count
#=> {"html"=>2, "#cdata-section"=>12, "script"=>15, "text"=>7958, "link"=>11, "title"=>1, "meta"=>4, "comment"=>18, "head"=>1, "div"=>1152, "input"=>2, "form"=>2, "img"=>135, "span"=>2878, "a"=>397, "button"=>434, "label"=>1, "li"=>740, "ul"=>265, "hr"=>3, "h3"=>117, "p"=>48, "br"=>3, "strong"=>2, "ol"=>1, "h2"=>26, "b"=>5, "body"=>1, "document"=>1}