Question

我很沮丧，试图使用Ruby来获取特定的网址内容。

我尝试了许多不同的方式，比如open-uri，标准请求到目前为止都没有。 我总是得到空的html。我还尝试使用python来获取始终返回正确html内容的相同网址。我真的不确定为什么......请帮助我，因为我对Ruby和Python都是新手... 我想使用Ruby（更喜欢整洁的语法和人性化的函数名称，更容易使用gem和homebrew安装库（在Mac上）比python easy_install）但我现在正在考虑Python，因为它只是起作用（但仍然试图解决2.x和3.x问题）。我可能会做一些非常愚蠢的事情，但我认为这是不太可能的。

ruby 1.9.2p136 (2010-12-25 revision 30365) [i386-darwin10.6.0]

实施1：

url = URI.parse('http//:www.stackoverflow.com/') req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http|   http.request(req) }    
puts res.body #empty

实施2：

doc = Nokogiri::HTML(open("http//:www.stackoverflow.com/", "User-Agent" => "Safari"))
#empty
#I tried to use without user agent, without Nokogiri none worked.

每次都完美运行的Python实现

f = urllib.urlopen("http//:www.stackoverflow.com/")
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()

print s

Answer 1

如果这是您的确切代码，则由于多种原因无效。

http：应该是http：//
网址需要一个路径。如果你想要example.com的根页，它需要http://example.com/，那么尾随斜杠很重要。
如果你在一行上放两行代码就需要使用;表示第一行的结尾

SO

require 'net/http'

url = URI.parse('http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http|   http.request(req) }    
puts res.body

在nokogiri中使用open也是如此

编辑：该网站多次返回不良结果：

counter = 0

20.times do
  url = URI.parse('http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia')
  req = Net::HTTP::Get.new(url.path)
  res = Net::HTTP.start(url.host, url.port) {|http|   http.request(req) }    
  sleep 1
  counter +=1 unless res.body.empty?
end

puts counter

curl "http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia"

产生相同的不一致结果。

Answer 2

openURI（标准库）的两个例子，（以及其他）相当繁琐的Net :: HTTP的包装器：

require 'open-uri'

open("http://www.stackoverflow.com/"){|f| puts f.read}

puts URI::parse("http://www.google.com/").read

ruby获取url内容始终为空

2 个答案: