我在使用curb执行简单的HTTP GET请求时遇到问题。
代码是:
def getHtml ()
raw = Curl::Easy.perform(@url)
puts raw.body_str
end
我尝试运行时收到的错误消息是:
Curl::Err::UnsupportedProtocolError (Curl::Err::UnsupportedProtocolError)
from /home/<Username>/.gem/ruby/2.1.0/gems/curb-0.8.5/lib/curl/easy.rb:317:in `perform'
from getCorpusData.rb:6:in `getHtml'
from getCorpusData.rb:11:in `<main>'
使用的请求是:
'http://corpus2.byu.edu/glowbe/x2.asp?chooser=seq&p=%5B%3Dbat%5D&w2=&wl=4&wr=4&r1=&r2=&ipos1=-select-&B7=SEARCH&showsec=y&sec1=0&sec2=0&sortBy=freq&sortByDo2=freq&minfreq1=freq&freq1=20&freq2=20&numhits=100&kh=100&groupBy=words&whatshow=raw&saveList=no&changed=&corpus=glowbe&word=&sbs=&sbs1=&sbsreg1=&sbsr=&sbsgroup=&redidID=&ownsearch=y&compared=&holder=&whatdo=seq&rand1=y&whatdo1=1&didRandom=n&minFreq=freq&s1=0&s2=0&s3=0&perc=mi' -H 'Accept-Encoding: gzip,deflate,sdch' -H 'Accept-Language: en-US,en;q=0.8,es;q=0.6' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' -H 'Referer: http://corpus2.byu.edu/glowbe/x1.asp?a=&user=&word=&k=&h=&q1=&q=&c=glowbe' -H 'Cookie: ASPSESSIONIDSQCCCACB=KKPJLDIDNPDBDHCBLFDKBKLE; ASPSESSIONIDSQDAADDB=KCGNJBIAONDCMKCLGNNHEEFM; __utma=93336079.428180068.1390938982.1390938982.1391007383.2; __utmb=93336079.1.10.1391007383; __utmc=93336079; __utmz=93336079.1391007383.2.2.utmcsr=corpus.byu.edu|utmccn=(referral)|utmcmd=referral|utmcct=/; ii=4' -H 'Connection: keep-alive'
适用于www.google.co.uk:
等示例* Protocol 'http not supported or disabled in libcurl
* Closing connection -1
/home/<username>/.gem/ruby/2.1.0/gems/curb-0.8.5/lib/curl/easy.rb:62:in `perform': Curl::Err::UnsupportedProtocolError (Curl::Err::UnsupportedProtocolError)
from /home/<username>/.gem/ruby/2.1.0/gems/curb-0.8.5/lib/curl/easy.rb:317:in `perform'
from getCorpusData.rb:7:in `getHtml'
from getCorpusData.rb:16:in `<main>'
我目前的方法是:
def getHtml ()
corpus = Curl::Easy.perform(@url) do |curl|
curl.headers["User-Agent"] = "GibSim-0.0"
curl.verbose = true
end
corpus.perform
puts corpus.body_str + "<_____HERE"
end
,当前网址为:
'http://corpus2.byu.edu/glowbe/x2.asp?chooser=seq&p=%5B%3Dbat%5D&w2=&wl=4&wr=4&r1=&r2=&ipos1=-select-&B7=SEARCH&showsec=y&sec1=0&sec2=0&sortBy=freq&sortByDo2=freq&minfreq1=freq&freq1=20&freq2=20&numhits=100&kh=100&groupBy=words&whatshow=raw&saveList=no&changed=&corpus=glowbe&word=&sbs=&sbs1=&sbsreg1=&sbsr=&sbsgroup=&redidID=&ownsearch=y&compared=&holder=&whatdo=seq&rand1=y&whatdo1=1&didRandom=n&minFreq=freq&s1=0&s2=0&s3=0&perc=mi' -H 'Accept-Encoding: gzip,deflate,sdch' -H 'Accept-Language: en-US,en;q=0.8,es;q=0.6'
我仍然不确定该怎么做!
答案 0 :(得分:2)
根据documentation的外观,您只应传入要抓取的网站的网址。如果你想传入不同的标题,你可以传入一个像这样的块
Curl::Easy.perform("http://www.google.co.uk") do |curl|
curl.headers["User-Agent"] = "myapp-0.0"
curl.verbose = true
end
答案 1 :(得分:0)
真正的问题是我缺少标题。
要解决此问题,我需要包含这些标头,我通过开发人员工具上的网络标签找到了这些标头。在Chrome中,可以通过按 CTRL + SHIFT + I 并单击标记为“网络”的选项卡来访问,然后发送所需的请求和点击它以查看具体信息。
我添加的标题在下面的代码中:
require 'curb'
def getHtml(the_url)
corpus = Curl::Easy.new(the_url) do |curl|
curl.headers["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
curl.headers["Accept-Encoding"] = "gzip,deflate,sdch"
curl.headers["Accept-Language"] = "en-US,en;q=0.8,es;q=0.6"
curl.enable_cookies = true
curl.follow_location = true
curl.http_auth_types = :basic
curl.username = "omitted"
curl.password = "omitted"
curl.headers["Connection"] = "keep-alive"
curl.headers["Cookie"] = "ASPSESSIONIDSQCCCACB=KKPJLDIDNPDBDHCBLFDKBKLE; ASPSESSIONIDSQDAADDB=KCGNJBIAONDCMKCLGNNHEEFM; ASPSESSIONIDQQCDCACA=NNJHINEBBGIAJCLLICPGGMEK; ASPSESSIONIDQSBACBCB=CCOPGJBCOLBJMFJHHBIOJEHM; ASPSESSIONIDSSBBBDCA=FJBJFFOCFCPAGJJDMINGFHNE; ASPSESSIONIDSSDDDACA=CPFPDBLDIOMAALINKEEAFHKA; ASPSESSIONIDQQBAADCA=PIKHCNHACCFNDLPBGFLAEIGH; ASPSESSIONIDSQADDBDB=LJPPAJEBADDMFNAKELPPACOL; ASPSESSIONIDSQCBADDB=BHCHBFBCOAPNIANIPEILJDMK; ASPSESSIONIDQQACBADB=BPGLCBOCABKMOJEHGMGINDFD; ASPSESSIONIDQSCBAACA=FCMHMMKDKPJNKAONBHDLAMEJ; password=; email=; __utma=93336079.428180068.1390938982.1391083358.1392043146.4; __utmc=93336079; __utmz=93336079.1391007383.2.2.utmcsr=corpus.byu.edu|utmccn=(referral)|utmcmd=referral|utmcct=/; ii=24"
curl.headers["Host"] = "corpus2.byu.edu"
curl.headers["Referer"] = "http://corpus2.byu.edu/glowbe/x1.asp?a=&user=&word=&k=&h=&q1=&q=&c=glowbe"
curl.headers["User-Agent"] = "GibSim-0.0"
curl.verbose = true
end
corpus.perform
return corpus.body_str
end
url = "http://corpus2.byu.edu/glowbe/x2.asp?chooser=seq&p=%5Bsolid%5D&w2=&wl=4&wr=4&r1=&r2=&ipos1=-select-&B7=SEARCH&showsec=y&sec1=0&sec2=0&sortBy=freq&sortByDo2=freq&minfreq1=freq&freq1=20&freq2=20&numhits=100&kh=100&groupBy=words&whatshow=raw&saveList=no&changed=&corpus=glowbe&word=&sbs=&sbs1=&sbsreg1=&sbsr=&sbsgroup=&redidID=&ownsearch=y&compared=&holder=&whatdo=seq&rand1=y&whatdo1=1&didRandom=n&minFreq=freq&s1=0&s2=0&s3=0&perc=mi"
puts getHtml(url)
我希望这有助于其他人。