从特定日期获取Google搜索页面

时间:2014-03-24 16:30:32

标签: python html href urllib2 google-search

我试图在特定的时间日期刮掉谷歌,比如2002年,2004年等等。我无法使用pygoogle,xgoogle或Google搜索,因为他们无法指定您要搜索的时段。所以,我找到了查询,但是在运行我的脚本时,无论我在哪个搜索页面,谷歌都会向我发送相同的结果。

这是我的代码:

import time
import urllib2
import re
import random
#Define search term.
agent='PT+e+PMDB'

#Define headers
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}

#Inc variable of loop
contador=0
#Vector where all links will be stored.
Links2002={}
#Number of pages to search through.
NPages=50

#Start routine.
for i in range(1,NPages,1):
    tempUrl2002='https://www.google.com/search?q='+str(agent)+'&hl=pt-BR&biw=1137&bih=1354&sa=X&ei=eR8rU8HTEIqhkQeEuoCICg&ved=0CBoQpwUoBjgU&source=lnt&tbs=cdr%3A1%2Ccd_min%3A01%2F01%2F2002%2Ccd_max%3A31%2F12%2F2002&tbm=#filter=0&hl=pt-BR&q='+str(agent)+'&start='+str(i*10)+'&tbs=cdr:1,cd_min:01/01/2002,cd_max:31/12/2002'
    #url used by Request.
    req=urllib2.Request(tempUrl2002,headers=hdr)
    #Search.
    SearchResults=urllib2.urlopen(req)
    #Get search data.
    page=SearchResults.read()
    #Define random pause of algorithm.
    wt=random.uniform(10,30)
    #Pause algorithm in order to prevent Google from stoping it.
    time.sleep(wt)
    #Get all links.
    links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F] [0-9a-fA-F]))+', page)
    #Armazena os resultados.
    for url in links:
        contador=contador+1
        Links2002[contador]=url

有谁知道怎么做对吗?是否有一种聪明的方法可以从特定日期获取Google搜索结果?

最佳, 胡

0 个答案:

没有答案