我正在尝试抓取右边带有“广告”的Google搜索结果,即抓取搜索结果中的Google广告链接。 我有以下脚本,我被困在汤.select()步骤中。我不确定要使用哪个选择器...提前感谢您的帮助 检查以下元素: screen capture of inspect element
#! python3
#!usr/bin/env python3
import requests, bs4, webbrowser
#Get Google search results
ui_search = input("Search google: ")
print('Googling...') #display text while downloading
if len(ui_search)>1:
res = requests.get('https://google.com/search?q=' + ' '.join(ui_search))
res.raise_for_status()
#Retrieve the results with ads and open them.
soup = bs4.BeautifulSoup(res.text, 'html.parser')
#Open a browser tab for each result
linkElems = soup.select('.V0MxL a')
linkElems2 = soup.select('.ad_cclk a')
numOpen = min(5, len(linkElems))
print(numOpen)
for i in range(numOpen):
print(linkElems[i].get('href'))
webbrowser.open('http://google.com' +linkElems[i].get('href'))
类似代码的代码,但未指定广告:
#! python3
#lucky.py - Opens several Google search results.
import requests
import sys
import webbrowser
import bs4
ui_search = input("Search google: ")
print('Googling...') #display text while downloading
if len(sys.argv) > 1:
res = requests.get('http://google.com/search?q=' + ' '.join(sys.argv[1:]))
elif len(ui_search) > 1:
res = requests.get('http://google.com/search?q=' + ' '.join(ui_search))
res.raise_for_status()
#Retrieve top search result links.
soup = bs4.BeautifulSoup(res.text, 'html.parser')
#type(soup)
#Open a browser tab for each result
linkElems = soup.select('.r a')
numOpen = min(5, len(linkElems))
for i in range(numOpen):
print(linkElems[i])
# webbrowser.open('http://google.com' + linkElems[i].get('href'))
Example results:
答案 0 :(得分:0)
对于这种特定情况,我宁愿使用 findAll()
/find_all()
方法,因为这样我可以获得更具体的信息并告诉 bs4
选择包含特定的 tag
里面,我可以在其中获取广告链接 URL。
只有当 Google 在脚本运行时显示这些广告时,这才有效。
代码和full example:
class
输出:
from bs4 import BeautifulSoup
import requests
import lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=graphic+card+buy&oq=graphic+card+buy&hl=en&gl=us&sourceid=chrome&ie=UTF-8', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
for link in soup.findAll('div', class_='RnJeZd top pla-unit-title'):
ad_link = link.a['href']
print(f'https://www.googleadservices.com/pagead{ad_link}')
或者,您可以使用来自 SerpApi 的 Google Ad Results API。这是一个免费试用的付费 API。查看 Playground 来玩玩。
要集成的代码:
https://www.googleadservices.com/pagead/aclk?sa=l&ai=DChcSEwils83_1PrvAhUNjsgKHdWRC7sYABAFGgJxdQ&sig=AOD64_39ASmacGcHYwy9gGKmKFRuPLiOQg&ctype=5&q=&ved=2ahUKEwinrcf_1PrvAhWFKs0KHZzNCsMQww96BAgCED0&adurl=
https://www.googleadservices.com/pagead/aclk?sa=l&ai=DChcSEwils83_1PrvAhUNjsgKHdWRC7sYABADGgJxdQ&sig=AOD64_2rqOA3PxFKKsigRh1yy3z5QKbtcw&ctype=5&q=&ved=2ahUKEwinrcf_1PrvAhWFKs0KHZzNCsMQww96BAgCEEk&adurl=
https://www.googleadservices.com/pagead/aclk?sa=L&ai=DChcSEwils83_1PrvAhUNjsgKHdWRC7sYABAEGgJxdQ&sig=AOD64_0WuY3UDlgTziPk9nUw0f8s3zW3nA&ctype=5&q=&ved=2ahUKEwinrcf_1PrvAhWFKs0KHZzNCsMQww96BAgCEFU&adurl=
部分 JSON 输出:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "graphic card buy",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for ads in results["shopping_results"]:
print(f"Ad link: {ads['link']}")
<块引用>
免责声明,我为 SerpApi 工作。