刮刮谷歌时如何检测验证码?

时间:2016-09-10 03:51:13

标签: beautifulsoup python-requests screen-scraping captcha google-search

我正在使用requests包与BeautifulSoup一起搜索Google新闻,了解查询的搜索结果数量。我有两种类型的IndexError,我想区分:

  1. 当搜索结果的数量为空时。这里#resultStats返回空字符串'[]'。似乎正在进行的是,当查询字符串太长时,谷歌甚至不会说" 0搜索结果&#34 ;;它只是没有说什么。
  2. 第二个IndexError是谷歌给我一个验证码。
  3. 我需要区分这些情况,因为当谷歌发给我一个验证码时,我希望我的刮刀等待五分钟,但不是当它只是一个空的结果字符串时。

    我目前有一个陪审团操纵的方法,我发送另一个查询,其中包含已知的非零数量的搜索结果,这使我可以区分这两个IndexErrors。我想知道使用BeautifulSoup是否有更优雅和直接的做法。

    这是我的代码:

    import requests, bs4, lxml, re, time, random
    import pandas as pd
    import numpy as np
    
    URL = 'https://www.google.com/search?tbm=nws&q={query}&tbs=cdr%3A1%2Ccd_min%3A{year}%2Ccd_max%3A{year}&authuser=0'
    headers = {
        "User-Agent":
            "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
    }
    
    def tester(): # test for captcha
        test = requests.get('https://www.google.ca/search?q=donald+trump&safe=off&client=ubuntu&espv=2&biw=1910&bih=969&source=lnt&tbs=cdr%3A1%2Ccd_min%3A2016%2Ccd_max%3A&tbm=nws', headers=headers)
        dump = bs4.BeautifulSoup(test.text,"lxml")
        result = dump.select('#resultStats')
        num = result[0].getText()
        num = re.search(r"\b\d[\d,.]*\b",num).group() # regex
        num = int(num.replace(',',''))
        num = (num > 0)
        return num
    
    def search(**params):
        response = requests.get(URL.format(**params),headers=headers)
        print(response.content, response.status_code) # check this for google requiring Captcha
        soup = bs4.BeautifulSoup(response.text,"lxml")
        elems = soup.select('#resultStats')
    
        try: # want code to flag if I get a Captcha
            hits = elems[0].getText()
            hits = re.search(r"\b\d[\d,.]*\b",hits).group() # regex
            hits = int(hits.replace(',',''))
            print(hits)    
            return hits
        except IndexError:
            try:
                tester() > 0 # if captcha, this will throw up another IndexError
                print("Empty results!")
                hits = 0
                return hits
            except IndexError:
                print("Captcha'd!")
                time.sleep(120) # should make it rotate IP when captcha'd
                hits = 0
                return hits
    
    for qry in list:
        hits = search(query= qry, year=2016)
    

1 个答案:

答案 0 :(得分:2)

我只是搜索"验证码"例如,如果这是Google Recaptcha,则可以搜索包含令牌的隐藏输入:

is_captcha_on_page = soup.find("input", id="recaptcha-token") is not None
相关问题