Question

我正在创建一个简单的WebCrawler，并且希望它能刮取诸如“ Donald Trump”之类的Google搜索查询的结果网页。我已经编写了以下代码：

# import requests
from urllib.request import urlopen as uReq
import urllib.request
from bs4 import BeautifulSoup as soup

paging_url = "https://www.google.gr/search? 
ei=fvtMW8KMI4vdwQLS67yICA&q=donald+trump&oq=donald+trump&gs_l=psy- ab.3..35i39k1j0i131k1j0i203k1j0j0i203k1j0l3j0i203k1l2.4578.6491.0.6763.12.9.0.0.0.0.447.879.4-2.2.0....0...1c.1.64.psy-ab..10.2.878....0.aB3Y8R5B0U8"

req = urllib.request.Request("https://www.google.gr/search?ei=fvtMW8KMI4vdwQLS67yICA&q=donald+trump&oq=donald+trump&gs_l=psy-ab.3..35i39k1j0i131k1j0i203k1j0j0i203k1j0l3j0i203k1l2.4578.6491.0.6763.12.9.0.0.0.0.447.879.4-2.2.0....0...1c.1.64.psy-ab..10.2.878....0.aB3Y8R5B0U8", headers={'User-Agent': "Magic Browser"})

UClient = uReq(req)  # downloading the url
page_html = UClient.read()
UClient.close()

page_soup = soup(page_html, "html.parser")
results = page_soup.findAll("div", {"class": "srg"})
print(len(results))

解释一下我的想法以及我对Google页面结构的注意：

我正在尝试仅获取搜索结果，而不是Google也显示的推荐视频或图像。当显示推荐的视频或图像时，在带有“ srg”类的两个“ div”标签下存在九个结果。在这些“ div”标签之间插入另一个带有推荐视频/图像的“ div”标签。

我的问题是我的代码无法“看到”属于“ srg”类的“ div”标签。我不知道为什么BeautifulSoup会忽略它们。属于“ rc”的“ div”标签也会发生相同的情况类。有人对为什么会发生有任何想法吗？

Answer 1

我在使用PhantomJS使Webcrawlers提取Google搜索数据时遇到了一些问题。有时我可以浏览几页，然后系统就会丢失。在某些情况下，我会发现在生成的代码中，我似乎正在执行非法操作，并且应该使用付费的API“ Custom Search JSON API”。我发现的解决方案是从Yahoo站点创建爬网程序。如果对我来说结果令人满意。

Google API可让您每天进行100次免费搜索。根据您的应用程序的目的，这可能是一个更安静的解决方案。

Answer 2

要获得唯一的搜索结果，您可以使用 SelectorGadgets Chrome 扩展程序通过 select()（可以迭代）或 select_one() 直观地抓取 CSS 选择器（只抓取一个元素）bs4 方法。

for result in soup.select('CSS_SELECTOR'):
    ....

soup.select_one('CSS_SELECTOR')

用于在 online IDE 中抓取标题、链接、显示的链接以及代码段和示例的代码：

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {'q': 'Donald Trump'}

html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')

# container with all needed data
for result in soup.select('.tF2Cxc'):
  title = result.select_one('.DKV0Md').text
  link = result.select_one('.yuRUbf').a['href']
  displayed_link = result.select_one('.TbwUpd.NJjxre').text
  snippet = result.select_one('.VwiC3b.yXK7lf.MUxGbd.yDYNvb.lyLwlc').text
  
  print(f'{title}\n{link}\n{displayed_link}\n{snippet}\n')

# part of the output:
'''
Donald Trump | TheHill
https://thehill.com/people/donald-trump
https://thehill.com › people › donald-trump
12 hours ago — Donald Trump. Donald Trump. Getty Images. 0 Tweet Share More. Occupation: President of the United States, 2017 - 2021. Political Affiliation: Republican.
'''

或者，您可以使用来自 SerpApi 的 Google Search Engine Results API 来做同样的事情。这是一个付费 API，可免费试用 5,000 次搜索。

本质上，区别在于您不必考虑如何抓取东西、绕过阻塞，这已经为最终用户完成了。查看playground。

要集成的代码：

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google",
  "q": "Donald Trump",
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  title = result['title']
  link = result['link']
  displayed_link = result['displayed_link']
  snippet = result['snippet']
  print(f'{title}\n{link}\n{displayed_link}\n{snippet}\n')

# part of the output:
'''
Donald Trump - Wikipedia
https://en.wikipedia.org/wiki/Donald_Trump
https://en.wikipedia.org › wiki › Donald_Trump
Donald John Trump (born June 14, 1946) is an American media personality and businessman who served as the 45th president of the United States from 2017 ...
'''

<块引用>

免责声明，我为 SerpApi 工作。

网络搜寻器无法从Google搜索中检索结果

2 个答案: