搜索引擎特别是search.lycos.co.uk。我可以通过脚本搜索它,但我无法从源中获取每个单独的结果,非常感谢任何帮助。 编辑:
host = 'http://search.lycos.co.uk/?query=%s&page2=%s' % (str(query), repr(page))
req = urllib2.Request(host)
req.add_header('User-Agent', User_Agent)
response = urllib2.urlopen(req)
source = response.read()
不知道从哪里可以获得每个结果。
答案 0 :(得分:0)
我试过了:
query='testing!'
page=1
host = 'http://search.lycos.co.uk/?query=%s&page2=%s' % (str(query), repr(page))
print urllib2.urlopen(host).read()
在那里试试,看看它是否有效。它在这里工作。
另外,我创建了urllib2.Request,它在这里工作:
import urllib
import urllib2
data = {'query': 'testing', 'page2': '1'}
req = urllib2.Request(host, data=urllib.urlencode(data))
req.add_header('User-Agent', <yours>)
print urllib2.urlopen(req).read()
跟进,如果您想要抓取数据,这些是很好的模块:
答案 1 :(得分:0)
Lycos加密了他们的搜索结果。但是,你可以试试谷歌。
import urllib, urllib2
from urllib import urlopen
from bs4 import BeautifulSoup
import re
from time import sleep
from random import choice, random
def scrapping_google(query):
g_url = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" %(urllib.quote_plus(query))
request = urllib2.Request(g_url, None, {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0'})
open_url = urllib2.urlopen(request)
read_url = open_url.read()
g_soup = BeautifulSoup(read_url)
remove_tag = re.compile(r'<.*?>')
g_dict = {}
scrap_count = g_soup.find('div', attrs={'id' : 'resultStats'})
count = remove_tag.sub('', str(scrap_count)).replace('.','')
only_count = count[0:-16]
print 'Prediction result: ', only_count
print '\n'
for li in g_soup.findAll('li', attrs={'class' : 'g'}):
links = li.find('a')
print links['href']
scrap_content = li.find('span', attrs={'class' : 'st'})
content = remove_tag.sub('', str(scrap_content)).replace('.','')
print content
return g_dict
if __name__ == '__main__':
fetch_links = scrapping_google('jokowi')