我如何使用Curb?获得不受支持的协议错误?

时间:2014-02-10 14:30:49

标签: ruby curl

我在使用curb执行简单的HTTP GET请求时遇到问题。

代码是:

def getHtml ()
  raw = Curl::Easy.perform(@url)
  puts raw.body_str
end

我尝试运行时收到的错误消息是:

Curl::Err::UnsupportedProtocolError (Curl::Err::UnsupportedProtocolError)
   from /home/<Username>/.gem/ruby/2.1.0/gems/curb-0.8.5/lib/curl/easy.rb:317:in `perform'
   from getCorpusData.rb:6:in `getHtml'
   from getCorpusData.rb:11:in `<main>'

使用的请求是:

'http://corpus2.byu.edu/glowbe/x2.asp?chooser=seq&p=%5B%3Dbat%5D&w2=&wl=4&wr=4&r1=&r2=&ipos1=-select-&B7=SEARCH&showsec=y&sec1=0&sec2=0&sortBy=freq&sortByDo2=freq&minfreq1=freq&freq1=20&freq2=20&numhits=100&kh=100&groupBy=words&whatshow=raw&saveList=no&changed=&corpus=glowbe&word=&sbs=&sbs1=&sbsreg1=&sbsr=&sbsgroup=&redidID=&ownsearch=y&compared=&holder=&whatdo=seq&rand1=y&whatdo1=1&didRandom=n&minFreq=freq&s1=0&s2=0&s3=0&perc=mi' -H 'Accept-Encoding: gzip,deflate,sdch' -H 'Accept-Language: en-US,en;q=0.8,es;q=0.6' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Referer: http://corpus2.byu.edu/glowbe/x1.asp?a=&user=&word=&k=&h=&q1=&q=&c=glowbe' -H 'Cookie: ASPSESSIONIDSQCCCACB=KKPJLDIDNPDBDHCBLFDKBKLE; ASPSESSIONIDSQDAADDB=KCGNJBIAONDCMKCLGNNHEEFM; __utma=93336079.428180068.1390938982.1390938982.1391007383.2; __utmb=93336079.1.10.1391007383; __utmc=93336079; __utmz=93336079.1391007383.2.2.utmcsr=corpus.byu.edu|utmccn=(referral)|utmcmd=referral|utmcct=/; ii=4' -H 'Connection: keep-alive'

适用于www.google.co.uk:

等示例
* Protocol 'http not supported or disabled in libcurl
* Closing connection -1
/home/<username>/.gem/ruby/2.1.0/gems/curb-0.8.5/lib/curl/easy.rb:62:in `perform':    Curl::Err::UnsupportedProtocolError (Curl::Err::UnsupportedProtocolError)
from /home/<username>/.gem/ruby/2.1.0/gems/curb-0.8.5/lib/curl/easy.rb:317:in `perform'
from getCorpusData.rb:7:in `getHtml'
from getCorpusData.rb:16:in `<main>'

我目前的方法是:

def getHtml ()
  corpus = Curl::Easy.perform(@url) do |curl|
     curl.headers["User-Agent"] = "GibSim-0.0"
     curl.verbose = true
  end
  corpus.perform
  puts corpus.body_str + "<_____HERE"
end

,当前网址为:

 'http://corpus2.byu.edu/glowbe/x2.asp?chooser=seq&p=%5B%3Dbat%5D&w2=&wl=4&wr=4&r1=&r2=&ipos1=-select-&B7=SEARCH&showsec=y&sec1=0&sec2=0&sortBy=freq&sortByDo2=freq&minfreq1=freq&freq1=20&freq2=20&numhits=100&kh=100&groupBy=words&whatshow=raw&saveList=no&changed=&corpus=glowbe&word=&sbs=&sbs1=&sbsreg1=&sbsr=&sbsgroup=&redidID=&ownsearch=y&compared=&holder=&whatdo=seq&rand1=y&whatdo1=1&didRandom=n&minFreq=freq&s1=0&s2=0&s3=0&perc=mi' -H 'Accept-Encoding: gzip,deflate,sdch' -H 'Accept-Language: en-US,en;q=0.8,es;q=0.6'

我仍然不确定该怎么做!

2 个答案:

答案 0 :(得分:2)

根据documentation的外观,您只应传入要抓取的网站的网址。如果你想传入不同的标题,你可以传入一个像这样的块

Curl::Easy.perform("http://www.google.co.uk") do |curl| 
  curl.headers["User-Agent"] = "myapp-0.0"
  curl.verbose = true
end

答案 1 :(得分:0)

真正的问题是我缺少标题。

要解决此问题,我需要包含这些标头,我通过开发人员工具上的网络标签找到了这些标头。在Chrome中,可以通过按 CTRL + SHIFT + I 并单击标记为“网络”的选项卡来访问,然后发送所需的请求和点击它以查看具体信息。

我添加的标题在下面的代码中:

require 'curb'

def getHtml(the_url)
  corpus = Curl::Easy.new(the_url) do |curl|
    curl.headers["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
    curl.headers["Accept-Encoding"] = "gzip,deflate,sdch"
    curl.headers["Accept-Language"] = "en-US,en;q=0.8,es;q=0.6"
    curl.enable_cookies = true
    curl.follow_location = true
    curl.http_auth_types = :basic
    curl.username = "omitted"
    curl.password = "omitted"
    curl.headers["Connection"] = "keep-alive"
    curl.headers["Cookie"] = "ASPSESSIONIDSQCCCACB=KKPJLDIDNPDBDHCBLFDKBKLE; ASPSESSIONIDSQDAADDB=KCGNJBIAONDCMKCLGNNHEEFM; ASPSESSIONIDQQCDCACA=NNJHINEBBGIAJCLLICPGGMEK; ASPSESSIONIDQSBACBCB=CCOPGJBCOLBJMFJHHBIOJEHM; ASPSESSIONIDSSBBBDCA=FJBJFFOCFCPAGJJDMINGFHNE; ASPSESSIONIDSSDDDACA=CPFPDBLDIOMAALINKEEAFHKA; ASPSESSIONIDQQBAADCA=PIKHCNHACCFNDLPBGFLAEIGH; ASPSESSIONIDSQADDBDB=LJPPAJEBADDMFNAKELPPACOL; ASPSESSIONIDSQCBADDB=BHCHBFBCOAPNIANIPEILJDMK; ASPSESSIONIDQQACBADB=BPGLCBOCABKMOJEHGMGINDFD; ASPSESSIONIDQSCBAACA=FCMHMMKDKPJNKAONBHDLAMEJ; password=; email=; __utma=93336079.428180068.1390938982.1391083358.1392043146.4; __utmc=93336079; __utmz=93336079.1391007383.2.2.utmcsr=corpus.byu.edu|utmccn=(referral)|utmcmd=referral|utmcct=/; ii=24"
    curl.headers["Host"] = "corpus2.byu.edu"
    curl.headers["Referer"] = "http://corpus2.byu.edu/glowbe/x1.asp?a=&user=&word=&k=&h=&q1=&q=&c=glowbe"
    curl.headers["User-Agent"] = "GibSim-0.0"
    curl.verbose = true
  end

  corpus.perform
  return corpus.body_str
end

url = "http://corpus2.byu.edu/glowbe/x2.asp?chooser=seq&p=%5Bsolid%5D&w2=&wl=4&wr=4&r1=&r2=&ipos1=-select-&B7=SEARCH&showsec=y&sec1=0&sec2=0&sortBy=freq&sortByDo2=freq&minfreq1=freq&freq1=20&freq2=20&numhits=100&kh=100&groupBy=words&whatshow=raw&saveList=no&changed=&corpus=glowbe&word=&sbs=&sbs1=&sbsreg1=&sbsr=&sbsgroup=&redidID=&ownsearch=y&compared=&holder=&whatdo=seq&rand1=y&whatdo1=1&didRandom=n&minFreq=freq&s1=0&s2=0&s3=0&perc=mi"

puts getHtml(url)

我希望这有助于其他人。