我在这里遇到了一个问题。 我在铁轨上使用红宝石: ruby 1.8.7(2011-12-10 patchlevel 356) rails 2.3.14
我正在尝试使用open-uri在以下地址进行简单的打开:
http://jollymag.net/n/10390 - летни-секс-пози-във-водата.html(链接 NSFW )
但是,读取时生成的文件会生成一个奇怪的(损坏的)字符串。 这也在ruby 1.9.3和rails 3.2.x上进行了测试。
require 'open-uri'
url = 'http://jollymag.net/n/10390-летни-секс-пози-във-водата.html'
url = URI.encode(url)
file = open(url)
doc = file.collect.to_s # <- the document is broken
document = Nokogiri::HTML.parse(doc,nil,"utf8")
puts document # <- the document after nokogiri has one line of content
我尝试了Iconv的东西和其他东西,但没有任何作用。对于确切的问题,上面的代码或多或少是一个最小的孤立案例。
我感谢任何帮助,因为我现在想要解决这个问题几天了。
此致 Yavor
答案 0 :(得分:2)
所以问题对我来说是个棘手的问题。 似乎有些服务器只返回gzip-ed响应。 因此,为了阅读你当然必须相应地阅读它。 我决定发布我的整个抓取代码,以便人们可以找到更完整的解决方案来解决这些问题。这是一个更大的阶级的一部分,所以它指的是很多时候自我。
希望它有所帮助!
SHINSO_HEADERS = {
'Accept' => '*/*',
'Accept-Charset' => 'utf-8, windows-1251;q=0.7, *;q=0.6',
'Accept-Encoding' => 'gzip,deflate',
'Accept-Language' => 'bg-BG, bg;q=0.8, en;q=0.7, *;q=0.6',
'Connection' => 'keep-alive',
'From' => 'support@xenium.bg',
'Referer' => 'http://svejo.net/',
'User-Agent' => 'Mozilla/5.0 (compatible; Shinso/1.0;'
}
def crawl(url_address)
self.errors = Array.new
begin
begin
url_address = URI.parse(url_address)
rescue URI::InvalidURIError
url_address = URI.decode(url_address)
url_address = URI.encode(url_address)
url_address = URI.parse(url_address)
end
url_address.normalize!
stream = ""
timeout(10) { stream = url_address.open(SHINSO_HEADERS) }
if stream.size > 0
url_crawled = URI.parse(stream.base_uri.to_s)
else
self.errors << "Server said status 200 OK but document file is zero bytes."
return
end
rescue Exception => exception
self.errors << exception
return
end
# extract information before html parsing
self.url_posted = url_address.to_s
self.url_parsed = url_crawled.to_s
self.url_host = url_crawled.host
self.status = stream.status
self.content_type = stream.content_type
self.content_encoding = stream.content_encoding
self.charset = stream.charset
if stream.content_encoding.include?('gzip')
document = Zlib::GzipReader.new(stream).read
elsif stream.content_encoding.include?('deflate')
document = Zlib::Deflate.new().deflate(stream).read
#elsif stream.content_encoding.include?('x-gzip') or
#elsif stream.content_encoding.include?('compress')
else
document = stream.read
end
self.charset_guess = CharGuess.guess(document)
if not self.charset_guess.blank? or
not self.charset_guess == 'utf-8' or
not self.charset_guess == 'utf8'
document = Iconv.iconv("UTF-8", self.charset_guess , document).to_s
end
document = Nokogiri::HTML.parse(document,nil,"utf8")
document.xpath('//script').remove
document.xpath('//SCRIPT').remove
for item in document.xpath('//*[translate(@src, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz")]')
item.set_attribute('src',make_absolute_address(item['src']))
end
document = document.to_s.gsub(/<!--(.|\s)*?-->/,'')
#document = document.to_s.gsub(/\<![ \r\n\t]*(--([^\-]|[\r\n]|-[^\-])*--[ \r\n\t]*)\>/,'')
self.content = Nokogiri::HTML.parse(document,nil,"utf8")
end