我正在尝试进行一些webscraping,但WWW:Mechanize gem似乎不喜欢编码和崩溃。
post请求导致302重定向(跟随机械化,到目前为止一直很好),结果页面似乎崩溃了。
我google了很多,但到目前为止没有任何问题可以解决这个问题。你们中有人有个主意吗?
代码:
require 'rubygems'
require 'mechanize'
agent = WWW::Mechanize.new
agent.user_agent_alias = 'Mac Safari'
answer = agent.post('https://www.budget.de/de/reservierung/privatkunden/step1/schnellbuchung',
{"Country" => "Deutschland",
"Abholstation" => "Aalen",
"Abgabestation" => "Aalen",
"Abholdatum" => "26.02.2009",
"Abholzeit_stunde" => "13",
"Abholzeit_minute" => "30",
"Abgabedatum" => "28.02.2009",
"Abgabezeit_stunde" => "13",
"Abgabezeit_minute" => "30",
"CountryID" => "DE",
"AbholstationID"=>"AA1",
"AbgabestationID"=>"AA1"
}
)
puts answer.body
错误:
D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/util.rb:29:in `iconv': "\204nderungen vorbe"... (Iconv::IllegalSequence)
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/util.rb:29:in `to_native_charset'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/response_header_handler.rb:29:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:30:in `pass'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/handler.rb:6:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/response_body_parser.rb:35:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:30:in `pass'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/handler.rb:6:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/pre_connect_hook.rb:14:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:25:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:494:in `fetch_page'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:545:in `fetch_page'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:403:in `post_form'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:322:in `post'
from test.rb:7
答案 0 :(得分:3)
该页面肯定是UTF-8,但是Mechanize使用NKF(核心Ruby库)来猜测编码,并且由于某种原因它出现在Shift JIS中。解决此问题的最快方法是覆盖Mechanize的编码映射,这样当它尝试使用Iconv将主体转换为UTF-8时,它也会将源编码传递为UTF-8。你可以这样做:
WWW::Mechanize::Util::CODE_DIC[:SJIS] = "UTF-8"
将它放在您require
Mechanize库的行之后。您可能希望在找到问题的根本原因之后立即设置值,甚至更好,并在必要时提交补丁。
注意:我解决这个问题的方法是使用backtrace调试Mechanize库。 to_native_charset
方法会调用问题所在的detect_charset
。
答案 1 :(得分:0)
在我的情况下,get方法返回Mechanize::File
,根本不使用编码。
我可以通过Iconv
手动转换来修复它,但这只有在你知道编码的情况下才有效。
result = @agent.get uri
# Mechanize::File instead of Mechanize::Page is returned
# so we have to convert manually
result = Iconv.conv("utf-8", "iso-8859-1", result.body)