Question

我正在尝试在指定的网页上创建字母（a，b，c等）的直方图。我计划使用哈希来制作直方图。但是，我在实际获取HTML时遇到了一些问题。

我目前的代码：

#!/usr/local/bin/ruby


require 'net/http'
require 'open-uri'


# This will be the hash used to store the
# histogram.
histogram = Hash.new(0)

def open(url)
    Net::HTTP.get(URI.parse(url))
end

page_content = open('_insert_webpage_here')

page_content.each do |i|
    puts i
end

这可以很好地获取HTML。然而，它得到了一切。对于www.stackoverflow.com，它给了我：

<body><h1>Object Moved</h1>This document may be found <a HREF="http://stackoverflow.com/">here</a></body>

假装它是正确的页面，我不想要html标签。我只是想获得Object Moved和This document may be found here。

有没有相当简单的方法可以做到这一点？

Answer 1

当您require 'open-uri'时，您无需使用Net :: HTTP重新定义open。

require 'open-uri'

page_content = open('http://www.stackoverflow.com').read

histogram = {}
page_content.each_char do |c|
  histogram[c] ||= 0
  histogram[c] += 1
end

注意：这不会在HTML文档中删除<tags>，因此<html><body>x!</body></html>将{ '<' => 4, 'h' => 2, 't' => 2, ... }而不是{ 'x' => 1, '!' => 1 }。要删除标签，您可以使用Nokogiri（您说不可用）或某种正则表达式（例如Dru's answer中的那个）。

Answer 2

请参阅Net :: HTTP文档here

上的“关注重定向”部分

Answer 3

在没有Nokogiri的情况下剥离html标签

puts page_content.gsub(/<\/?[^>]*>/, "")

http://codesnippets.joyent.com/posts/show/615

使用Ruby下载HTML文本

3 个答案: