我有如下的python代码。它搜索谷歌新闻页面并打印每个新闻的超链接和标题。我的问题是,谷歌新闻组在一个桶和下面的脚本中相似的新闻只打印每个桶中的第一个新闻。如何从所有桶中打印所有新内容?
from bs4 import BeautifulSoup
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
#r = requests.get('http://www.aflcio.org/Legislation-and-Politics/Legislative-Alerts', headers=headers)
r = requests.get('https://www.google.com/search?q=%22eric+bledsoe%22&tbm=nws&tbs=qdr:d', headers=headers)
r = requests.get('https://www.google.com/search?q=%22lebron+james%22&tbm=nws&tbs=qdr:y', headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
letters = soup.find_all("div", class_="_cnc")
#print soup.prettify()
#print letters
print type(letters)
print len(letters)
print("\n")
for x in range(0, len(letters)):
print x
print letters[x].a["href"]
print("\n")
letters2 = soup.find_all("a", class_="l _HId")
for x in range(0, len(letters2)):
print x
print letters2[x].get_text()
print ("\n----------content")
#print letters[0]
通过热议新闻我的意思是在下面的图片中,前几条新闻被组合在一起。 “LeBron James将他的一个队友与Denn比较”的消息是另一组的一部分。
答案 0 :(得分:1)
我不确定你的意思是什么?如果您的意思是说您正在尝试解析多个网站,那么我可以通过发送多条新闻r
来告诉您覆盖requests.get()
这是一个循环来处理urls数组中的所有URL。
import bs4
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
urls = ["https://www.google.com/search?q=%22eric+bledsoe%22&tbm=nws&tbs=qdr:d",
"https://www.google.com/search?q=%22lebron+james%22&tbm=nws&tbs=qdr:y"]
ahrefs = []
titles = []
for url in urls:
req = requests.get(url, headers=headers)
soup = bs4.BeautifulSoup(req.text, "html.parser")
#you don't even have to process the div container
#just go strait to <a> and using indexing get "href"
#headlines
ahref = [a["href"] for a in soup.find_all("a", class_="_HId")]
#"buckets"
ahref += [a["href"] for a in soup.find_all("a", class_="_sQb")]
ahrefs.append(ahref)
#or get_text() will return the array inside the hyperlink
#the title you want
title = [a.get_text() for a in soup.find_all("a", class_="_HId")]
title += [a.get_text() for a in soup.find_all("a", class_="_sQb")]
titles.append(title)
#print(ahrefs)
#print(titles)
我的Google搜索lebron会显示18个结果,包括副标题和len(ahrefs[1]) == 18
答案 1 :(得分:1)
随着一个全新的转变,我决定采取更有效的方式来解决这个问题,这样你只需追加查询来搜索新玩家。我不确定你想要什么样的最终结果,但这会返回一个字典列表。
import bs4
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
#just add to this list for each new player
#player name : url
queries = {"bledsoe":"https://www.google.com/search?q=%22eric+bledsoe%22&tbm=nws&tbs=qdr:d",
"james":"https://www.google.com/search?q=%22lebron+james%22&tbm=nws&tbs=qdr:y"}
total = []
for player in queries: #keys
#request the google query url of each player
req = requests.get(queries[player], headers=headers)
soup = bs4.BeautifulSoup(req.text, "html.parser")
#look for the main container
for each in soup.find_all("div"):
results = {player: { \
"link": None, \
"title": None, \
"source": None, \
"time": None} \
}
try:
#if <div> doesn't have class="anything"
#it will throw a keyerror, just ignore
if "_cnc" in each.attrs["class"]: #mainstories
results[player]["link"] = each.find("a")["href"]
results[player]["title"] = each.find("a").get_text()
sourceAndTime = each.contents[1].get_text().split("-")
results[player]["source"], results[player]["time"] = sourceAndTime
total.append(results)
elif "card-section" in each.attrs["class"]: #buckets
results[player]["link"] = each.find("a")["href"]
results[player]["title"] = each.find("a").get_text()
results[player]["source"] = each.contents[1].contents[0].get_text()
results[player]["time"] = each.contents[1].get_text()
total.append(results)
except KeyError:
pass