使用python和beautifulsoup对Google Ads进行网络抓取

时间:2019-02-21 00:54:08

标签: python web-scraping beautifulsoup

我正在尝试抓取右边带有“广告”的Google搜索结果,即抓取搜索结果中的Google广告链接。 我有以下脚本,我被困在汤.select()步骤中。我不确定要使用哪个选择器...提前感谢您的帮助 检查以下元素: screen capture of inspect element

#! python3
#!usr/bin/env python3

import  requests, bs4, webbrowser

#Get Google search results
ui_search = input("Search google: ")
print('Googling...') #display text while downloading
if len(ui_search)>1:
    res = requests.get('https://google.com/search?q=' + ' '.join(ui_search))
    res.raise_for_status()

#Retrieve the results with ads and open them.
soup = bs4.BeautifulSoup(res.text, 'html.parser')

#Open a browser tab for each result
linkElems = soup.select('.V0MxL a')
linkElems2 = soup.select('.ad_cclk a')
numOpen = min(5, len(linkElems))
print(numOpen)
for i in range(numOpen):
    print(linkElems[i].get('href'))
    webbrowser.open('http://google.com' +linkElems[i].get('href'))

类似代码的代码,但未指定广告:

#! python3
#lucky.py - Opens several Google search results.

import requests
import sys
import webbrowser
import bs4

ui_search = input("Search google: ")
print('Googling...') #display text while downloading
if len(sys.argv) > 1:
    res = requests.get('http://google.com/search?q=' + ' '.join(sys.argv[1:]))
elif len(ui_search) > 1:
    res = requests.get('http://google.com/search?q=' + ' '.join(ui_search))
    res.raise_for_status()

#Retrieve top search result links.
soup = bs4.BeautifulSoup(res.text, 'html.parser')
#type(soup)
#Open a browser tab for each result
linkElems = soup.select('.r a')
numOpen = min(5, len(linkElems))
for i in range(numOpen):
    print(linkElems[i])
   # webbrowser.open('http://google.com' + linkElems[i].get('href'))

Example results:

enter image description here

1 个答案:

答案 0 :(得分:0)

对于这种特定情况,我宁愿使用 findAll()/find_all() 方法,因为这样我可以获得更具体的信息并告诉 bs4 选择包含特定的 tag 里面,我可以在其中获取广告链接 URL。

只有当 Google 在脚本运行时显示这些广告时,这才有效。

代码和full example

class

输出:

from bs4 import BeautifulSoup
import requests
import lxml

headers = {
  "User-Agent":
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q=graphic+card+buy&oq=graphic+card+buy&hl=en&gl=us&sourceid=chrome&ie=UTF-8', headers=headers).text

soup = BeautifulSoup(html, 'lxml')

for link in soup.findAll('div', class_='RnJeZd top pla-unit-title'):
  ad_link = link.a['href']
  print(f'https://www.googleadservices.com/pagead{ad_link}')

或者,您可以使用来自 SerpApi 的 Google Ad Results API。这是一个免费试用的付费 API。查看 Playground 来玩玩。

要集成的代码:

https://www.googleadservices.com/pagead/aclk?sa=l&ai=DChcSEwils83_1PrvAhUNjsgKHdWRC7sYABAFGgJxdQ&sig=AOD64_39ASmacGcHYwy9gGKmKFRuPLiOQg&ctype=5&q=&ved=2ahUKEwinrcf_1PrvAhWFKs0KHZzNCsMQww96BAgCED0&adurl=
https://www.googleadservices.com/pagead/aclk?sa=l&ai=DChcSEwils83_1PrvAhUNjsgKHdWRC7sYABADGgJxdQ&sig=AOD64_2rqOA3PxFKKsigRh1yy3z5QKbtcw&ctype=5&q=&ved=2ahUKEwinrcf_1PrvAhWFKs0KHZzNCsMQww96BAgCEEk&adurl=
https://www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwils83_1PrvAhUNjsgKHdWRC7sYABAEGgJxdQ&sig=AOD64_0WuY3UDlgTziPk9nUw0f8s3zW3nA&ctype=5&q=&ved=2ahUKEwinrcf_1PrvAhWFKs0KHZzNCsMQww96BAgCEFU&adurl=

部分 JSON 输出:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "graphic card buy",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for ads in results["shopping_results"]:
   print(f"Ad link: {ads['link']}")
<块引用>

免责声明,我为 SerpApi 工作。