对于我正在做的工作,我需要查询Google(或类似的搜索引擎)大约40K次。我只对查询返回的匹配数感兴趣。我写了一个脚本来做到这一点,但我被困在第100个查询附近,我得到503错误。显然,有一个限制。
问题是,有哪些替代搜索引擎和Python API可以让我无限地查询项目?
以下代码是我到目前为止所尝试的:
import requests
from bs4 import BeautifulSoup
import time
from random import randint
def get_count(word1, word2):
time.sleep(randint(5,15))
r = requests.get('http://www.google.com/search',
params={'q':'"'+word1+' '+word2+'"',
"tbs":"li:1"}
)
while not r:
print("****** wait ... "+str(r))
time.sleep(randint(10,100))
r = requests.get('http://www.google.com/search',
params={'q':'"'+word1+' '+word2+'"',
"tbs":"li:1"}
)
if r.ok:
soup = BeautifulSoup(r.text, "lxml")
res = soup.find('div',{'id':'resultStats'}).text
if res:
try:
return int(res)
except:
print(res.split())
if res.startswith('About'):
return int(res.split()[1].replace(',', ''))
else:
return int(res.split()[0].replace(',', ''))
else:
return 0
答案 0 :(得分:0)
我不知道如何在R中执行此操作,但这里有一个Excel / VBA解决方案。
Sub Gethits()
Dim url As String, lastRow As Long
Dim XMLHTTP As Object, html As Object, objResultDiv As Object, objH3 As Object, link As Object
Dim start_time As Date
Dim end_time As Date
Dim var As String
Dim var1 As Object
lastRow = Range("A" & Rows.Count).End(xlUp).Row
Dim cookie As String
Dim result_cookie As String
start_time = Time
Debug.Print "start_time:" & start_time
For i = 2 To lastRow
url = "https://www.google.com/search?q=" & Cells(i, 1) & "&rnd=" & WorksheetFunction.RandBetween(1, 10000)
Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP")
XMLHTTP.Open "GET", url, False
XMLHTTP.setRequestHeader "Content-Type", "text/xml"
XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
XMLHTTP.send
Set html = CreateObject("htmlfile")
html.body.innerHTML = XMLHTTP.ResponseText
Set objResultDiv = html.getelementbyid("rso")
Set var1 = html.getelementbyid("resultStats")
Cells(i, 2).Value = var1.innerText
DoEvents
Next
end_time = Time
Debug.Print "end_time:" & end_time
Debug.Print "done" & "Time taken : " & DateDiff("n", start_time, end_time)
MsgBox "done" & "Time taken : " & DateDiff("n", start_time, end_time)
End Sub
将您的搜索字词放在ColumnA中,然后运行脚本。