从搜索引擎中刮取的替代方法

时间:2018-01-25 19:42:54

标签: python python-3.x search web-scraping beautifulsoup

对于我正在做的工作,我需要查询Google(或类似的搜索引擎)大约40K次。我只对查询返回的匹配数感兴趣。我写了一个脚本来做到这一点,但我被困在第100个查询附近,我得到503错误。显然,有一个限制。

问题是,有哪些替代搜索引擎和Python API可以让我无限地查询项目?

以下代码是我到目前为止所尝试的:

import requests
from bs4 import BeautifulSoup
import time
from random import randint

def get_count(word1, word2):
    time.sleep(randint(5,15))   
    r = requests.get('http://www.google.com/search',
                     params={'q':'"'+word1+' '+word2+'"',
                             "tbs":"li:1"}
                    )
    while not r:
        print("****** wait ... "+str(r))    
        time.sleep(randint(10,100))
        r = requests.get('http://www.google.com/search',
                     params={'q':'"'+word1+' '+word2+'"',
                             "tbs":"li:1"}
                    )
    if r.ok:
        soup = BeautifulSoup(r.text, "lxml")
        res = soup.find('div',{'id':'resultStats'}).text
        if res:
            try:
                return int(res)
            except:
                print(res.split())
                if res.startswith('About'):
                    return int(res.split()[1].replace(',', ''))
                else:
                    return int(res.split()[0].replace(',', ''))

        else:
            return 0

1 个答案:

答案 0 :(得分:0)

我不知道如何在R中执行此操作,但这里有一个Excel / VBA解决方案。

Sub Gethits()
    Dim url As String, lastRow As Long
    Dim XMLHTTP As Object, html As Object, objResultDiv As Object, objH3 As Object, link As Object
    Dim start_time As Date
    Dim end_time As Date
    Dim var As String
    Dim var1 As Object

    lastRow = Range("A" & Rows.Count).End(xlUp).Row

    Dim cookie As String
    Dim result_cookie As String

    start_time = Time
    Debug.Print "start_time:" & start_time

    For i = 2 To lastRow

        url = "https://www.google.com/search?q=" & Cells(i, 1) & "&rnd=" & WorksheetFunction.RandBetween(1, 10000)

        Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP")
        XMLHTTP.Open "GET", url, False
        XMLHTTP.setRequestHeader "Content-Type", "text/xml"
        XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
        XMLHTTP.send

        Set html = CreateObject("htmlfile")
        html.body.innerHTML = XMLHTTP.ResponseText
        Set objResultDiv = html.getelementbyid("rso")
        Set var1 = html.getelementbyid("resultStats")
        Cells(i, 2).Value = var1.innerText

        DoEvents
    Next

    end_time = Time
    Debug.Print "end_time:" & end_time

    Debug.Print "done" & "Time taken : " & DateDiff("n", start_time, end_time)
    MsgBox "done" & "Time taken : " & DateDiff("n", start_time, end_time)
End Sub

将您的搜索字词放在ColumnA中,然后运行脚本。

enter image description here