我正在尝试从谷歌搜索下载链接(在Python中),我正在使用美丽的汤来做到这一点。 http://www.google.ca/search?q=QUERY_HERE
是我收到请求的网址。我希望从第2页/第3页获得更多链接。
如何执行此操作,以及如何仅使用Google新闻搜索进行搜索?
答案 0 :(得分:0)
首先使用google.com页面右下角的搜索设置选项为您找出每页结果设置。 或检查下面的链接是否仍然有效
https://www.google.co.in/preferences?hl=en
然后在查询中,您可以指定开始值
https://www.google.co.in/search?q=hello&hl=en---------- 开始= 70 强> --------。
因此,如果 start = 0 ,您就在第一页,然后您只需根据每页结果更改开始值。
答案 1 :(得分:0)
要仅使用 Google 新闻进行搜索,您需要将 tbm=nws
添加到您的网址。
https://www.google.com/search?q=coca+cola
--> https://www.google.com/search?q=coca+cola&tbm=nws
以下是使用 beautifulsoup
、requests
、urllib
库抓取实际分页的方法。
online IDE 中的代码和示例:
from bs4 import BeautifulSoup
import requests, urllib.parse
def paginate(url, previous_url=None):
# Break from infinite recursion
if url == previous_url: return
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get(url, headers=headers).text
soup = BeautifulSoup(response, 'lxml')
# First page
yield soup
next_page_node = soup.select_one('a#pnnext')
# Stop when there is no next page
if next_page_node is None: return
next_page_url = urllib.parse.urljoin('https://www.google.com/',
next_page_node['href'])
# Pages after the first one
yield from paginate(next_page_url, url)
def scrape():
pages = paginate(
"https://www.google.com/search?hl=en-US&q=coca+cola&tbm=nws")
for soup in pages:
print(f'Current page: {int(soup.select_one(".YyVfkd").text)}')
print()
for data in soup.findAll('div', class_='dbsr'):
title = data.find('div', class_='JheGif nDgy9d').text
link = data.a['href']
print(f'Title: {title}')
print(f'Link: {link}')
print()
# part of the output:
'''
Results via beautifulsoup
Current page: 1
Title: A Post-Truth World: Why Ronaldo Did Not Move Coca-Cola Share Price
Link: https://www.forbes.com/sites/iese/2021/06/19/a-post-truth-world-why-ronaldo-did-not-move-coca-cola-share-price/
...
Current page: 22
Title: The Coca-Cola Co. unveils oat milk line
Link: https://www.foodbusinessnews.net/articles/18356-the-coca-cola-co-unveils-oat-milk-line
'''
或者,您可以使用来自 SerpApi 的 Google Search Engine Results API 来做同样的事情。这是一个付费 API,可免费试用 5,000 次搜索。查看playground。
要集成的代码:
# https://github.com/serpapi/google-search-results-python
from serpapi import GoogleSearch
import os
def scrape():
params = {
"engine": "google",
"q": "coca cola",
"tbm": "nws",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
pages = search.pagination()
for result in pages:
print(f"Current page: {result['serpapi_pagination']['current']}")
for news_result in result["news_results"]:
print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")
# part of the output:
'''
Results from SerpApi
Current page: 1
Title: A Post-Truth World: Why Ronaldo Did Not Move Coca-Cola Share Price
Link: https://www.forbes.com/sites/iese/2021/06/19/a-post-truth-world-why-ronaldo-did-not-move-coca-cola-share-price/
...
Current page: 5
Title: Coca-Cola, Monster win appeal of $9.6 million verdict over ...
Link: https://www.reuters.com/legal/transactional/coca-cola-monster-win-appeal-96-million-verdict-over-hansens-rights-2021-06-18/
'''
<块引用>
免责声明,我为 SerpApi 工作。