Question

我正在http://manga.bleachexile.com/gantz-chapter-1.html及以后在Gantz漫画中解析图像。

直到我的爬虫试图打开图像（见第273章）我才成功：

错误的URI（不是URI？）：http://static.bleachexile.com/manga/gantz/273/Gantz[0273]_p001[Whatever-Illuminati].png

但这个网址是有效的我猜，因为我可以从Firefox打开..有什么想法吗？

部分代码：

img_link = nav.page.image_urls.find {|x| x.include?("manga/gantz")}
img_name = RAILS_ROOT+"/public/#{nome}/#{cap}/"+nome+((template).sub('::cap::', cap.to_s).sub('::pag::', i.to_s))
img = File.new( img_name, 'w' )
img.write( open(img_link) {|f| f.read} )
img.close

Answer 1

这不是一个有效的uri。 uri只允许使用某些字符。顺便说一句，像所有浏览器一样，firefox尝试为用户尽可能地做，而不是在它看起来不符合标准时抱怨。

以下列形式有效：

open("http://static.bleachexile.com/manga/gantz/273/Gantz%5B0273%5D_p001%5BWhatever-Illuminati%5D.png") # => #<File:/tmp/open-uri20100226-3342-clj08a-0>

你可以试着像这样逃避它：

uri.gsub(/\/.*/) do |t|
  t.gsub(/[^.\/a-zA-Z0-9\-_ ]/) do |c|
    "%#{ c[0]<16 ? "0" : "" }#{ c[0].to_s(16).upcase }"
  end.gsub(" ", "+")
end

但要小心，如果网站使用正确的转义uri's并且你第二次逃脱它们。 uri不再指向同一个位置了。

Ruby open-uri，在打开png URL时返回错误

1 个答案: