Question

我正在尝试抓取Google新闻标题及其输入词的链接。但是，当我通过 find_all 方法搜索包含所有新闻标题的类时，它返回了一个空列表。

我尝试了具有其ID的父div，但结果没有不同。

import requests
from bs4 import BeautifulSoup

input_term = input("Enter a term to search:")
source = requests.get("https://www.google.com/search?q={0}&source=lnms&tbm=nws".format(input_term)).text
soup = BeautifulSoup(source, 'html.parser')

#here 'bkWMgd' is class that I found to be contained all search results.
heading_results = soup.find_all('div', class_ = 'bkWMgd')
print(heading_results)

我要抓取所有新闻标题及其各自的链接。我希望上面的代码列出所有搜索结果。但是它返回一个空列表。

Answer 1

beautifulsoup和浏览器中看到的响应由于存在Javascript而完全不同。因此，您使用的选择器可能会有所不同。打印从beautifulsoup收到的响应并分析HTML，然后适当地使用class / id决定选择器，始终是一个好主意。

import requests
from bs4 import BeautifulSoup

input_term = input("Enter a term to search:")
source = requests.get(
    "https://www.google.com/search?q={0}&source=lnms&tbm=nws".format(input_term)).text
soup = BeautifulSoup(source, 'html.parser')

# here div#ires contains an ol which contains the results.
heading_results = soup.find("div", {"id": "ires"}).find("ol").find_all('h3', {'class': 'r'})
# Loop over each item to obtain the title and link (anchor tag text and link)
print(heading_results)

Answer 2

这是我在多个搜索结果上测试的代码。为了使其适用于不同的搜索结果，只需更改 requests.get 变量中的 response。

也可以使用较短的网址（例如：https://www.google.com/search?hl=en-US&q=best+cookies&tbm=nws）。

代码和full example：

from bs4 import BeautifulSoup
import requests

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

response = requests.get(
    'https://www.google.com/search?hl=en-US&q=best+coockie&tbm=nws&sxsrf=ALeKk009n7GZbzUhUpsMTt89rigSAluBsQ%3A1616683043826&ei=I6BcYP_OMeGlrgTAwLpA&oq=best+coockie&gs_l=psy-ab.3...325216.326993.0.327292.12.12.0.0.0.0.163.1250.2j9.11.0....0...1c.1.64.psy-ab..1.0.0....0.305S8ngx0uo',
    headers=headers)

html = response.text
soup = BeautifulSoup(html, 'lxml')

for headings in soup.findAll('div', class_='dbsr'):
    title = headings.find('div', class_='JheGif nDgy9d').text
    link = headings.a['href']
    print(title)
    print(link)
    print()

输出：

The BEST cookie on the planet (and the Village too!)
https://thecoastnews.com/the-best-cookie-on-the-planet-and-the-village-too/

Best baking kits for kids 2021: Cookie mixes to flapjack recipes
https://www.independent.co.uk/extras/indybest/food-drink/baking/best-kids-baking-kits-b1821245.html

The official Girl Scout cookie power rankings
https://www.latimes.com/food/story/2021-02-24/girl-scout-cookie-power-rankings

Girl Scout Cookie Taste Test: Little Brownie Bakers vs. ABC
https://www.thedailymeal.com/eat/girl-scout-cookie-taste-comparison-abc-little-brownie-bakers

Food Critic, Provocateur Definitively Ranks Girl Scout Cookies
https://www.npr.org/2021/03/07/974226510/food-critic-provocateur-definitively-ranks-girl-scout-cookies

Chef Magnus Nilsson Jam Shortbread Cookie Recipe From ...
https://www.bloomberg.com/news/articles/2021-02-26/chef-magnus-nilsson-jam-shortbread-cookie-recipe-from-faviken-breakfast

Top 10 Best Cookie Cutters 2021 – Bestgamingpro
https://bestgamingpro.com/cookie-cutters/

Learn to make a favorite Girl Scout cookie at home
https://www.latimes.com/food/story/2021-02-25/learn-to-make-the-best-girl-scout-cookie-at-home

The 5 Best Cookie Jars
https://www.elitedaily.com/p/the-5-best-cookie-jars-63505798

Ulker Biskuvi Turkey's Best Cookie Picked as Top Stock for 2021
https://www.bloomberg.com/news/articles/2021-02-25/cookie-maker-tops-turkey-s-best-stock-bets-amid-hunt-for-value

此外，为了获得 .text 和 url's，您需要指定要从哪个来源（div 或其他）抓取它。

在你的代码中，你只指定了一个 div 和一个 class + 如果你想返回 .text，它会给你一个错误：AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

在这种情况下，您可以使用 for loop 并在里面获取您想要的内容。

有时当您调用 find_all()/findAll() 时，它会给您一个空列表，因为您没有指定 user-agent。默认 user-agent 是不同的（可能是平板电脑），具有不同的类和选择器。因此，当您使用 class_=() "bkWMgd" 调用请求时，实际上这个 class_() 是不同的，因为它具有不同的 user-agent。希望这是有道理的。

我跳过了 input 元素，因为它使事情变得复杂:)

或者，您也可以使用 SerpApi News Result API 获得这些（以及更多）结果。

SerpApi JSON 新闻结果示例：

"news_results": [
    {
      "position": 1,
      "title": "Trump brushes aside environmental concerns, signs 2 executive ...",
      "link": "https://www.usatoday.com/story/news/nation/2019/04/10/president-trump-orders-speed-oil-gas-pipeline-projects/3431466002/",
      "source": "USA TODAY",
      "date": "6 hours ago",
      "snippet": "Aiming to streamline oil and gas pipeline projects, President Donald Trump on Wednesday signed two executive orders making it harder for ...",
      "category": "In-Depth",
      "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRQdBI3wIjf_BX3zfRRYJjTGRRF5CNNZvqWAuza8-4mVZ75iBjlwOVTxcfGtg6_hLyUbPQ9cFA"
    }
]

要集成的代码：

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "best cookies",
    "tbm": "nws",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for news_result in results["news_results"]:
   print(f"Title: {news_result['title']}\n, Link: {news_result['link']}")

<块引用>

免责声明：我为 SerpApi 工作。

无法通过他们的班级抓取Google新闻标题

2 个答案: