我试图在特定的时间日期刮掉谷歌,比如2002年,2004年等等。我无法使用pygoogle,xgoogle或Google搜索,因为他们无法指定您要搜索的时段。所以,我找到了查询,但是在运行我的脚本时,无论我在哪个搜索页面,谷歌都会向我发送相同的结果。
这是我的代码:
import time
import urllib2
import re
import random
#Define search term.
agent='PT+e+PMDB'
#Define headers
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
#Inc variable of loop
contador=0
#Vector where all links will be stored.
Links2002={}
#Number of pages to search through.
NPages=50
#Start routine.
for i in range(1,NPages,1):
tempUrl2002='https://www.google.com/search?q='+str(agent)+'&hl=pt-BR&biw=1137&bih=1354&sa=X&ei=eR8rU8HTEIqhkQeEuoCICg&ved=0CBoQpwUoBjgU&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2002%2Ccd_max%3A31%2F12%2F2002&tbm=#filter=0&hl=pt-BR&q='+str(agent)+'&start='+str(i*10)+'&tbs=cdr:1,cd_min:01/01/2002,cd_max:31/12/2002'
#url used by Request.
req=urllib2.Request(tempUrl2002,headers=hdr)
#Search.
SearchResults=urllib2.urlopen(req)
#Get search data.
page=SearchResults.read()
#Define random pause of algorithm.
wt=random.uniform(10,30)
#Pause algorithm in order to prevent Google from stoping it.
time.sleep(wt)
#Get all links.
links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F] [0-9a-fA-F]))+', page)
#Armazena os resultados.
for url in links:
contador=contador+1
Links2002[contador]=url
有谁知道怎么做对吗?是否有一种聪明的方法可以从特定日期获取Google搜索结果?
最佳, 胡