Ruby Mechanize因编码错误而停止:由于输入错误,输入转换失败,字节0xFA 0xBA 0x3C 0x2F

时间:2017-12-24 09:47:17

标签: ruby nokogiri mechanize libxml2

我使用Ruby mechanize(Ruby版本:2.3.1p112,Mehanize版本:2.7.5)来抓取网页。 由于输入错误导致某些页面压缩"输入转换失败" 我要抓的页面是http://www.jra.go.jp/JRADB/accessD.html 和我的代码并在下面执行日志。

agent = Mechanize.new
agent.log = Logger.new(STDOUT)
agent.post('http://www.jra.go.jp/JRADB/accessD.html', { "cname" => "pw01bmd0006201705080220171224/7A" })

D, [2017-12-24T18:34:32.039095 #37600] DEBUG -- : query: "cname=pw01bmd0006201705080220171224%2F7A"
I, [2017-12-24T18:34:32.039562 #37600]  INFO -- : Net::HTTP::Post: /JRADB/accessD.html
D, [2017-12-24T18:34:32.039657 #37600] DEBUG -- : request-header: accept-encoding => gzip,deflate,identity
D, [2017-12-24T18:34:32.039742 #37600] DEBUG -- : request-header: accept => */*
D, [2017-12-24T18:34:32.039873 #37600] DEBUG -- : request-header: user-agent => Mechanize/2.7.5 Ruby/2.3.1p112 (http://github.com/sparklemotion/mechanize/)
D, [2017-12-24T18:34:32.039996 #37600] DEBUG -- : request-header: accept-charset => ISO-8859-1,utf-8;q=0.7,*;q=0.7
D, [2017-12-24T18:34:32.040105 #37600] DEBUG -- : request-header: accept-language => en-us,en;q=0.5
D, [2017-12-24T18:34:32.040209 #37600] DEBUG -- : request-header: host => www.jra.go.jp
D, [2017-12-24T18:34:32.040323 #37600] DEBUG -- : request-header: referer => http://www.jra.go.jp/JRADB/accessD.html
D, [2017-12-24T18:34:32.040428 #37600] DEBUG -- : request-header: content-type => application/x-www-form-urlencoded
D, [2017-12-24T18:34:32.040547 #37600] DEBUG -- : request-header: content-length => 40
I, [2017-12-24T18:34:32.093773 #37600]  INFO -- : status: Net::HTTPOK 1.1 200 OK
D, [2017-12-24T18:34:32.094155 #37600] DEBUG -- : response-header: server => Apache
D, [2017-12-24T18:34:32.094296 #37600] DEBUG -- : response-header: x-frame-options => SAMEORIGIN
D, [2017-12-24T18:34:32.094736 #37600] DEBUG -- : response-header: content-encoding => gzip
D, [2017-12-24T18:34:32.094953 #37600] DEBUG -- : response-header: content-length => 13989
D, [2017-12-24T18:34:32.095272 #37600] DEBUG -- : response-header: content-type => text/html
D, [2017-12-24T18:34:32.095590 #37600] DEBUG -- : response-header: date => Sun, 24 Dec 2017 09:34:32 GMT
D, [2017-12-24T18:34:32.095980 #37600] DEBUG -- : response-header: connection => keep-alive
D, [2017-12-24T18:34:32.096387 #37600] DEBUG -- : response-header: vary => Accept-Encoding
D, [2017-12-24T18:34:32.096894 #37600] DEBUG -- : Read 6790 bytes (6790 total)
D, [2017-12-24T18:34:32.105888 #37600] DEBUG -- : Read 7199 bytes (13989 total)
D, [2017-12-24T18:34:32.106710 #37600] DEBUG -- : gzip response
encoding error : input conversion failed due to input error, bytes 0xFA 0xBA 0x3C 0x2F

我理解这是字符编码问题的原因,如果发生解析错误,我想忽略。我怎么能这样做?

0 个答案:

没有答案