使用www :: mechanize时的Iconv :: IllegalSequence

时间:2009-02-25 14:22:24

标签: ruby screen-scraping iconv mechanize-ruby

我正在尝试进行一些webscraping,但WWW:Mechanize gem似乎不喜欢编码和崩溃。
post请求导致302重定向(跟随机械化,到目前为止一直很好),结果页面似乎崩溃了。 我google了很多,但到目前为止没有任何问题可以解决这个问题。你们中有人有个主意吗?

代码:

require 'rubygems'
require 'mechanize'

agent = WWW::Mechanize.new

agent.user_agent_alias = 'Mac Safari'
answer = agent.post('https://www.budget.de/de/reservierung/privatkunden/step1/schnellbuchung',
{"Country" => "Deutschland",
"Abholstation" => "Aalen",
"Abgabestation" => "Aalen",
"Abholdatum" => "26.02.2009",
"Abholzeit_stunde" => "13",
"Abholzeit_minute" => "30",
"Abgabedatum" => "28.02.2009",
"Abgabezeit_stunde" => "13",
"Abgabezeit_minute" => "30",
"CountryID" => "DE",
"AbholstationID"=>"AA1",
"AbgabestationID"=>"AA1"
}
)
puts answer.body

错误:

D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/util.rb:29:in `iconv': "\204nderungen vorbe"... (Iconv::IllegalSequence)
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/util.rb:29:in `to_native_charset'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/response_header_handler.rb:29:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:30:in `pass'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/handler.rb:6:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/response_body_parser.rb:35:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:30:in `pass'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/handler.rb:6:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/pre_connect_hook.rb:14:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:25:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:494:in `fetch_page'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:545:in `fetch_page'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:403:in `post_form'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:322:in `post'
from test.rb:7

2 个答案:

答案 0 :(得分:3)

该页面肯定是UTF-8,但是Mechanize使用NKF(核心Ruby库)来猜测编码,并且由于某种原因它出现在Shift JIS中。解决此问题的最快方法是覆盖Mechanize的编码映射,这样当它尝试使用Iconv将主体转换为UTF-8时,它也会将源编码传递为UTF-8。你可以这样做:

WWW::Mechanize::Util::CODE_DIC[:SJIS] = "UTF-8"

将它放在您require Mechanize库的行之后。您可能希望在找到问题的根本原因之后立即设置值,甚至更好,并在必要时提交补丁。

注意:我解决这个问题的方法是使用backtrace调试Mechanize库。 to_native_charset方法会调用问题所在的detect_charset

答案 1 :(得分:0)

在我的情况下,get方法返回Mechanize::File,根本不使用编码。
我可以通过Iconv手动转换来修复它,但这只有在你知道编码的情况下才有效。

result = @agent.get uri
# Mechanize::File instead of Mechanize::Page is returned 
# so we have to convert manually
result = Iconv.conv("utf-8", "iso-8859-1", result.body)