使用Python和Beautiful Soup抓取Google新闻结果只会检索没有标题的第一页

时间:2019-11-17 11:25:27

标签: python web-scraping beautifulsoup

我想根据搜索到的术语从Google新闻搜索页面中抓取标题和段落文本。 我想在前 n 页中这样做。

我已经编写了一段仅用于抓取第一页的代码,但是我不知道如何修改url以便可以转到其他页面(第2、3 ...页)。那是我遇到的第一个问题

第二个问题是我不知道该如何抓标题。它总是返回我空白列表。我尝试了多种解决方案,但始终会返回空白列表。 (我认为该页面不是动态的。)

另一方面,在标题下方抓取段落文本效果很好。 你能告诉我如何解决这两个问题吗?

这是我的代码:

from bs4 import BeautifulSoup
import requests

term = 'cocacola'

# this is only for page 1, how to go to page 2?
url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# I think that this is not javascipt sensitive, its not dynamic            
headline_results = soup.find_all('a', class_="l lLrAF")
#headline_results = soup.find_all('h3', class_="r dO0Ag") # also does not work
print(headline_results) #empty list, IDK why?

paragraph_results = soup.find_all('div', class_='st')
print(paragraph_results) # works

3 个答案:

答案 0 :(得分:1)

问题一:翻转页面。

要转到下一页,您需要在URL格式的字符串中包含start关键字:

term = 'cocacola'
page = 2
url = 'https://www.google.com/search?q={}&source=lnms&tbm=nws&start={}'.format(
    term, (page - 1) * 10
)

问题二:取消标题。

Google重新生成DOM元素的类,ID等的名称,因此,每次检索一些新的未缓存信息时,您的方法很可能会失败。

答案 1 :(得分:1)

只需在搜索词中添加参数“ start = 10”。喜欢: https://www.google.com/search?q=beatifulsoup&ie=utf-8&oe=utf-8&aq=t&start=10

对于动态行为/响应页面上的循环,请使用以下内容:

from bs4 import BeautifulSoup
from request import get

term="beautifulsoup"
page_max = 5

# loop over pages
for page in range(0, page_max):
    url = "https://www.google.com/search?q={}&ie=utf-8&oe=utf-8&aq=t&start={}".format(term, 10*page)

    r = get(url) # you can also add headers here
    html_soup = BeautifulSoup(r.text, 'html.parser')

答案 2 :(得分:0)

Link 到我之前回答的部分相同的问题。


或者,您可以使用来自 SerpApi 的 Google News Result API。这是一个免费试用的付费 API。

部分 JSON 输出:

"news_results": [
  {
    "position": 1,
    "link": "https://www.stltoday.com/lifestyles/food-and-cooking/best-bites-pepperidge-farms-caramel-macchiato-flavored-milano-cookies/article_d43e59a0-b362-5cb0-bdef-6b7563d9fed3.html",
    "title": "Best Bites: Pepperidge Farms Caramel Macchiato flavored Milano cookies",
    "source": "St. Louis Post-Dispatch",
    "date": "1 week ago",
    "snippet": "Coffee-flavored food items are usually very hit or miss. But we have found \nthe cookie that has accomplished the absolute best coffee flavoring I ...",
    "thumbnail": "https://serpapi.com/searches/608ffbbcef7ddabfb2982432/images/45d252f31c08b743573f629544c119f07e8c422143bff0265f31c8c08086393a.jpeg"
  }
]

要集成的代码:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "best cookies",
  "tbm": "nws",
  "start": "10",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for news_result in results["news_results"]:
  print(f"Title: {news_result['title']}\n")

输出:

Title: 10 Of The Absolute Best Cookies In Sydney
    
Title: This Cookie Quiz Will Reveal Your Best And Worst Quality

Title: Family cookies by Taimur Ali Khan is the best thing on internet

Title: Gibson Dunn Ranked Among Top Three Firms for Client ...

Title: Livingston CARES: Saying thank you to one cookie at a time

Title: Google's plan to replace cookies is the web's best hope for a more private internet

Title: The 12 Best Cookies in NYC

Title: 18 Places to Find the Best Cookies in the Champaign-Urbana ...

Title: Best Cookie Delivery Services - Where to Order Cookies Online

Title: How to make the best cookies for the holidays
<块引用>

免责声明,我为 SerpApi 工作。