我想根据搜索到的术语从Google新闻搜索页面中抓取标题和段落文本。 我想在前 n 页中这样做。
我已经编写了一段仅用于抓取第一页的代码,但是我不知道如何修改url
以便可以转到其他页面(第2、3 ...页)。那是我遇到的第一个问题。
第二个问题是我不知道该如何抓标题。它总是返回我空白列表。我尝试了多种解决方案,但始终会返回空白列表。 (我认为该页面不是动态的。)
另一方面,在标题下方抓取段落文本效果很好。 你能告诉我如何解决这两个问题吗?
这是我的代码:
from bs4 import BeautifulSoup
import requests
term = 'cocacola'
# this is only for page 1, how to go to page 2?
url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(term)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# I think that this is not javascipt sensitive, its not dynamic
headline_results = soup.find_all('a', class_="l lLrAF")
#headline_results = soup.find_all('h3', class_="r dO0Ag") # also does not work
print(headline_results) #empty list, IDK why?
paragraph_results = soup.find_all('div', class_='st')
print(paragraph_results) # works
答案 0 :(得分:1)
问题一:翻转页面。
要转到下一页,您需要在URL格式的字符串中包含start
关键字:
term = 'cocacola'
page = 2
url = 'https://www.google.com/search?q={}&source=lnms&tbm=nws&start={}'.format(
term, (page - 1) * 10
)
问题二:取消标题。
Google重新生成DOM元素的类,ID等的名称,因此,每次检索一些新的未缓存信息时,您的方法很可能会失败。
答案 1 :(得分:1)
只需在搜索词中添加参数“ start = 10”。喜欢:
https://www.google.com/search?q=beatifulsoup&ie=utf-8&oe=utf-8&aq=t&start=10
对于动态行为/响应页面上的循环,请使用以下内容:
from bs4 import BeautifulSoup
from request import get
term="beautifulsoup"
page_max = 5
# loop over pages
for page in range(0, page_max):
url = "https://www.google.com/search?q={}&ie=utf-8&oe=utf-8&aq=t&start={}".format(term, 10*page)
r = get(url) # you can also add headers here
html_soup = BeautifulSoup(r.text, 'html.parser')
答案 2 :(得分:0)
Link 到我之前回答的部分相同的问题。
或者,您可以使用来自 SerpApi 的 Google News Result API。这是一个免费试用的付费 API。
部分 JSON 输出:
"news_results": [
{
"position": 1,
"link": "https://www.stltoday.com/lifestyles/food-and-cooking/best-bites-pepperidge-farms-caramel-macchiato-flavored-milano-cookies/article_d43e59a0-b362-5cb0-bdef-6b7563d9fed3.html",
"title": "Best Bites: Pepperidge Farms Caramel Macchiato flavored Milano cookies",
"source": "St. Louis Post-Dispatch",
"date": "1 week ago",
"snippet": "Coffee-flavored food items are usually very hit or miss. But we have found \nthe cookie that has accomplished the absolute best coffee flavoring I ...",
"thumbnail": "https://serpapi.com/searches/608ffbbcef7ddabfb2982432/images/45d252f31c08b743573f629544c119f07e8c422143bff0265f31c8c08086393a.jpeg"
}
]
要集成的代码:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "best cookies",
"tbm": "nws",
"start": "10",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for news_result in results["news_results"]:
print(f"Title: {news_result['title']}\n")
输出:
Title: 10 Of The Absolute Best Cookies In Sydney
Title: This Cookie Quiz Will Reveal Your Best And Worst Quality
Title: Family cookies by Taimur Ali Khan is the best thing on internet
Title: Gibson Dunn Ranked Among Top Three Firms for Client ...
Title: Livingston CARES: Saying thank you to one cookie at a time
Title: Google's plan to replace cookies is the web's best hope for a more private internet
Title: The 12 Best Cookies in NYC
Title: 18 Places to Find the Best Cookies in the Champaign-Urbana ...
Title: Best Cookie Delivery Services - Where to Order Cookies Online
Title: How to make the best cookies for the holidays
<块引用>
免责声明,我为 SerpApi 工作。