Question

尝试抓取谷歌搜索结果。这个代码适用于所有其他网站，我试过，但不与谷歌合作。它返回一个空列表。

from BeautifulSoup import BeautifulSoup
import requests

def googlecrawler(search_term):
    url="https://www.google.co.in/?gfe_rd=cr&ei=UVSeVZazLozC8gfU3oD4DQ&gws_rd=ssl#q="+search_term
    junk_code=requests.get(url)
    ok_code=junk_code.text
    good_code=BeautifulSoup(ok_code)
    best_code=good_code.findAll('h3',{'class':'r'})
    print best_code


googlecrawler("healthkart")

它应该返回这样的东西。

<h3 class="r"><a href="/url?  sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=6&amp;cad=rja&amp;uact=8&amp;ved=0CEIQFjAF&amp;url=http%3A%2F%2Fwww.coupondunia.in%2Fhealthkart&amp;ei=qFmfVc2fFNO0uASti4PwDQ&amp;usg=AFQjCNFHMzqn-rH4Hp-fZK0E4wwxJmevEg&amp;sig2=QgwxMBdbPndyQTSH10dV2Q" onmousedown="return rwt(this,'','','','6','AFQjCNFHMzqn-rH4Hp-fZK0E4wwxJmevEg','QgwxMBdbPndyQTSH10dV2Q','0CEIQFjAF','','',event)" data-href="http://www.coupondunia.in/healthkart">HealthKart Coupons: July 2015 Coupon Codes</a></h3>

Answer 1

查看good_code我根本看不到h3或class "r"。这就是你的代码返回一个空列表的原因。

您的代码没有问题，而是您正在搜索的内容不存在。

你期待什么回来？

Answer 2

一种常见的解决方案是在您的请求中添加 user-agent aka headers 以伪造真实用户访问：

# https://www.whatismybrowser.com/guides/the-latest-user-agent/
headers = {
  "User-Agent":
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

所以你的代码看起来像这样：

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

def googlecrawler(search_term): 
    html = requests.get(f'https://www.google.com/search?q=', headers=headers).text
    soup = BeautifulSoup(html, 'lxml')

    for container in soup.findAll('div', class_='tF2Cxc'):
        title = container.select_one('.DKV0Md').text
        link = container.find('a')['href']
        print(f'{title}\n{link}')

googlecrawler('site:Facebook.com Dentist gmail.com')

# part of the output:
'''
COVID-19 Office Update Dear... - Canton Dental Associates ...
https://www.facebook.com/permalink.php?id=107567882605441&story_fbid=3205134459515419

Spinelli Dental - General Dentist - Rochester, New York ...
https://www.facebook.com/spinellidental/about/?referrer=services_landing_page

LaboSmile USA Dentist & Dental Office in Delray ... - Facebook
https://www.facebook.com/labosmileusa/
'''

或者，您可以使用来自 SerpApi 的 Google Search Engine Results API 来实现。这是一个付费 API，可免费试用 5,000 次搜索。

要集成的代码：

from serpapi import GoogleSearch
import os, json, re

params = {
  "engine": "google",
  "q": "site:Facebook.com Dentist gmail.com",
  "api_key": os.getenv('API_KEY')
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  title = result['title']
  link = result['link']
  print(f'{title}\n{link}\n')

# part of the output:
'''
Green Valley Dental - About | Facebook
https://www.facebook.com/GVDFamily/about/

My Rivertown Dentist - About | Facebook
https://www.facebook.com/Rivertownfamily/about/

COVID-19 Office Update Dear... - Canton Dental Associates ...
https://www.facebook.com/permalink.php?id=107567882605441&story_fbid=3205134459515419
'''

<块引用>

免责声明，我为 SerpApi 工作。

BeautifulSoup无法抓取谷歌搜索结果？

2 个答案: