我正在创建一个简单的WebCrawler,并且希望它能刮取诸如“ Donald Trump”之类的Google搜索查询的结果网页。我已经编写了以下代码:
# import requests
from urllib.request import urlopen as uReq
import urllib.request
from bs4 import BeautifulSoup as soup
paging_url = "https://www.google.gr/search?
ei=fvtMW8KMI4vdwQLS67yICA&q=donald+trump&oq=donald+trump&gs_l=psy- ab.3..35i39k1j0i131k1j0i203k1j0j0i203k1j0l3j0i203k1l2.4578.6491.0.6763.12.9.0.0.0.0.447.879.4-2.2.0....0...1c.1.64.psy-ab..10.2.878....0.aB3Y8R5B0U8"
req = urllib.request.Request("https://www.google.gr/search?ei=fvtMW8KMI4vdwQLS67yICA&q=donald+trump&oq=donald+trump&gs_l=psy-ab.3..35i39k1j0i131k1j0i203k1j0j0i203k1j0l3j0i203k1l2.4578.6491.0.6763.12.9.0.0.0.0.447.879.4-2.2.0....0...1c.1.64.psy-ab..10.2.878....0.aB3Y8R5B0U8", headers={'User-Agent': "Magic Browser"})
UClient = uReq(req) # downloading the url
page_html = UClient.read()
UClient.close()
page_soup = soup(page_html, "html.parser")
results = page_soup.findAll("div", {"class": "srg"})
print(len(results))
解释一下我的想法以及我对Google页面结构的注意:
我正在尝试仅获取搜索结果,而不是Google也显示的推荐视频或图像。当显示推荐的视频或图像时,在带有“ srg”类的两个“ div”标签下存在九个结果。在这些“ div”标签之间插入另一个带有推荐视频/图像的“ div”标签。
我的问题是我的代码无法“看到”属于“ srg”类的“ div”标签。我不知道为什么BeautifulSoup会忽略它们。属于“ rc”的“ div”标签也会发生相同的情况 类。有人对为什么会发生有任何想法吗?
答案 0 :(得分:0)
我在使用PhantomJS使Webcrawlers提取Google搜索数据时遇到了一些问题。有时我可以浏览几页,然后系统就会丢失。在某些情况下,我会发现在生成的代码中,我似乎正在执行非法操作,并且应该使用付费的API“ Custom Search JSON API”。我发现的解决方案是从Yahoo站点创建爬网程序。如果对我来说结果令人满意。
Google API可让您每天进行100次免费搜索。根据您的应用程序的目的,这可能是一个更安静的解决方案。
答案 1 :(得分:0)
要获得唯一的搜索结果,您可以使用 SelectorGadgets Chrome 扩展程序通过 select()
(可以迭代)或 select_one()
直观地抓取 CSS 选择器(只抓取一个元素)bs4
方法。
for result in soup.select('CSS_SELECTOR'):
....
soup.select_one('CSS_SELECTOR')
用于在 online IDE 中抓取标题、链接、显示的链接以及代码段和示例的代码:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {'q': 'Donald Trump'}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
# container with all needed data
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf').a['href']
displayed_link = result.select_one('.TbwUpd.NJjxre').text
snippet = result.select_one('.VwiC3b.yXK7lf.MUxGbd.yDYNvb.lyLwlc').text
print(f'{title}\n{link}\n{displayed_link}\n{snippet}\n')
# part of the output:
'''
Donald Trump | TheHill
https://thehill.com/people/donald-trump
https://thehill.com › people › donald-trump
12 hours ago — Donald Trump. Donald Trump. Getty Images. 0 Tweet Share More. Occupation: President of the United States, 2017 - 2021. Political Affiliation: Republican.
'''
或者,您可以使用来自 SerpApi 的 Google Search Engine Results API 来做同样的事情。这是一个付费 API,可免费试用 5,000 次搜索。
本质上,区别在于您不必考虑如何抓取东西、绕过阻塞,这已经为最终用户完成了。查看playground。
要集成的代码:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "Donald Trump",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
link = result['link']
displayed_link = result['displayed_link']
snippet = result['snippet']
print(f'{title}\n{link}\n{displayed_link}\n{snippet}\n')
# part of the output:
'''
Donald Trump - Wikipedia
https://en.wikipedia.org/wiki/Donald_Trump
https://en.wikipedia.org › wiki › Donald_Trump
Donald John Trump (born June 14, 1946) is an American media personality and businessman who served as the 45th president of the United States from 2017 ...
'''
<块引用>
免责声明,我为 SerpApi 工作。