我从来没有机会学习网络抓取。 我想知道可以在以下代码中添加什么以便在给定时间段内获得头条新闻? 如果只能获得财经新闻,那就太好了!
import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen
news_url="https://news.google.com/news/rss"
Client=urlopen(news_url)
xml_page=Client.read()
Client.close()
soup_page=soup(xml_page,"xml")
news_list=soup_page.findAll("item")
# Print news title, url and publish date
for news in news_list:
print(news.title.text)
print(news.link.text)
print(news.pubDate.text)
print("-"*60)
答案 0 :(得分:0)
这是我的解决方案。检查我最后包含的get_headlines(start_date,end_date)方法。
我将您抓取的XML中的格式转换为datetime对象,并将其与我指定的其他datetime对象进行比较以产生布尔值。我们可以根据显示的布尔值判断文章是否在我们的范围内,然后仅选择那些文章。
import bs4
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen
from datetime import datetime
news_url="https://news.google.com/news/rss"
Client=urlopen(news_url)
xml_page=Client.read()
Client.close()
soup_page=soup(xml_page,"xml")
news_list=soup_page.findAll("item")
# Print news title, url and publish date
for news in news_list:
print(news.title.text)
print(news.link.text)
print(news.pubDate.text)
print("-"*60)
print("Date Object: ", datetime.strptime(news.pubDate.text, "%a, %d %B %Y %X %Z"))
sample_end_date = "Wed, 08 May 2019 18:17:04 GMT"
print(datetime.strptime(sample_end_date, "%a, %d %B %Y %X %Z") > datetime.strptime(news.pubDate.text, "%a, %d %B %Y %X %Z"))
#datetime of article is less than the datetime of the end date
sample_start_date = "Wed, 08 May 2019 00:00:00 GMT"
print(datetime.strptime(sample_start_date, "%a, %d %B %Y %X %Z") < datetime.strptime(news.pubDate.text, "%a, %d %B %Y %X %Z"))
#datetime of article is greater than the datetime of the start date
#If both values are true, then we know that the article falls within the range we specified. If not, then it falls outside the range.'''
def get_headlines(start_date= input("Enter start date. \nFollow this format exactly for date input Wed, 08 May 2019 18:17:04 GMT: \n"), end_date= input("Enter end date. \n")):
start_date_object = datetime.strptime(start_date, "%a, %d %B %Y %X %Z")
end_date_object = datetime.strptime(end_date, "%a, %d %B %Y %X %Z")
news_url="https://news.google.com/news/rss"
Client=urlopen(news_url)
xml_page=Client.read()
Client.close()
soup_page=soup(xml_page,"xml")
news_list=soup_page.findAll("item")
# Print news title, url and publish date
print(f"All articles from {start_date_object} to {end_date_object}: \n")
for news in news_list:
if (end_date_object>datetime.strptime(news.pubDate.text, "%a, %d %B %Y %X %Z")>start_date_object):
print(news.title.text)
print(news.link.text)
print(news.pubDate.text)
print("-"*60)
get_headlines()
以下是从格林尼治标准时间星期三午夜到格林尼治标准时间星期三18:00的示例输出:
输入开始日期。
严格按照此格式输入日期,星期三,2019年5月8日18:17:04 GMT:
2019年5月8日星期三00:00:00 GMT
输入结束日期。
2019年5月8日星期三18:17:04 GMT
从2019-05-08 00:00:00到2019-05-08 18:17:04的所有文章:
白宫称特朗普对科罗拉多枪击案进行了简报-福克斯新闻 https://www.foxnews.com/us/trump-briefed-on-colorado-shooting-white-house-says-politicians-offer-condolences伊朗领导人宣布部分退出核协议-CNN https://www.cnn.com/2019/05/08/middleeast/iran-nuclear-deal-intl/index.html
中国在美国贸易协定的几乎所有方面都回溯了:消息来源-CNBC https://www.cnbc.com/2019/05/08/china-backtracked-on-nearly-all-aspects-of-us-trade-deal-sources.html
巴尔的最高助手在特朗普的世界中看到俄罗斯的调查很少-POLITICO https://www.politico.com/story/2019/05/08/brian-rabbitt-william-barr-1309751
WH指示前律师不遵守国会传票-ABC新闻 https://abcnews.go.com/Politics/white-house-instruct-counsel-comply-congressional-subpoena/story?id=62873987
特朗普在佛罗里达州潘汉德尔举行集会,救灾资金搁浅-NPR https://www.npr.org/2019/05/08/720803270/as-hurricane-relief-stalls-in-d-c-trump-to-rally-base-in-florida-panhandle
科罗拉多州STEM学校的学生Brendan Bialy帮助解除了枪手的武装-NBC新闻 https://www.nbcnews.com/news/us-news/colorado-stem-school-student-brendan-bialy-helped-disarm-gunman-n1003181
庞培因伊朗紧张局势加剧而对伊拉克进行意外访问-Aljazeera.com https://www.aljazeera.com/news/2019/05/pompeo-surprise-iraq-visit-rising-iran-tensions-190508034718722.html
在南非大选中,拉马福萨面临幻想破灭的选民的判决-纽约时报 https://www.nytimes.com/2019/05/08/world/africa/south-africa-election.html
拉合尔爆炸:苏菲神社附近至少有6人在爆炸中丧生-CNN https://www.cnn.com/2019/05/08/asia/lahore-blast-intl/index.html
亿万富翁查理·芒格(Charlie Munger)将比特币投资者与“犹大伊斯卡里奥特(Judas Iscariot)”进行比较-以太坊世界新闻 https://ethereumworldnews.com/billionaire-charlie-munger-compares-bitcoin-investors-to-judas-iscariot/
将助手链接到实时电视指南数据后,Android TV将受益-Engadget https://www.engadget.com/2019/05/08/google-assistant-epg-android-tv-play-store/
金·卡戴珊的监狱改革:金·卡戴珊·韦斯特在过去90天内帮助了17人脱离了监狱-CBS新闻 https://www.cbsnews.com/news/kim-kardashian-west-has-helped-free-17-people-from-prison-in-the-last-90-days/
吸血鬼周末与Haim一起在“ Fallon”上演出:观看-干草叉 https://pitchfork.com/news/vampire-weekend-perform-with-haim-on-fallon-watch/
乔治·克鲁尼透露哈利和梅根的皇室婴儿分享他的生日-Daily Mail https://www.dailymail.co.uk/tvshowbiz/article-7004777/George-Clooney-reveals-Prince-Harry-Meghan-Markles-newborn-shares-birthday.html
奥克兰A投手迈克·菲尔斯(Mike Fiers)投出职业生涯第二顺位击败了红军-Fox News https://www.foxnews.com/sports/athletics-fiers-pitching-no-hitter-beats-reds
乔·纳马特(Joe Namath)自从在电视直播中的尴尬时刻以来就没有喝过酒-NBC体育 http://profootballtalk.nbcsports.com/2019/05/07/joe-namath-hasnt-had-a-drink-since-his-embarrassing-moment-on-live-tv/
Mariners跌至.500并再次崩溃,导致布朗克斯5-4失利-西雅图时报 https://www.seattletimes.com/sports/mariners/mariners-fall-to-500-with-another-bullpen-collapse-that-leads-to-5-4-loss-in-bronx/
移动硅开关:有一种新的计算方法-Phys.org https://phys.org/news/2019-05-silicon.html
美国国家航空航天局(NASA)小行星:航天局阐明了大胆的小行星防御计划-“理想目标”-Express.co.uk https://www.express.co.uk/news/science/1123704/NASA-asteroid-double-redirection-test-NASA-DART-asteroid-Didymos
RFK Jr.是我们的兄弟和叔叔。他在疫苗方面悲惨地犯了错误。 -政治 https://www.politico.com/magazine/story/2019/05/08/robert-kennedy-jr-measles-vaccines-226798
答案 1 :(得分:0)
尝试feedparser
:
import feedparser
news_url=r'https://news.google.com/news/rss'
fp = feedparser.parse(news_url)
## number of entries
len(fp['entries'])
输出:
38
文章索引为“ 0”的标题:
print(fp['entries'][0]['title'])
输出:
School Shooting in Colorado Leaves 1 Student Dead and 7 Injured - The New York Times
在索引“ 0”处打印所有输入信息: fp ['entries'] [0]
输出:
{'title': 'School Shooting in Colorado Leaves 1 Student Dead and 7 Injured - The New York Times',
'title_detail': {'type': 'text/plain',
'language': None,
'base': 'https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en',
'value': 'School Shooting in Colorado Leaves 1 Student Dead and 7 Injured - The New York Times'},
'links': [{'rel': 'alternate',
'type': 'text/html',
'href': 'https://www.nytimes.com/2019/05/07/us/colorado-school-shooting.html'}],
'link': 'https://www.nytimes.com/2019/05/07/us/colorado-school-shooting.html',
'id': '52780288859641',
'guidislink': False,
'published': 'Wed, 08 May 2019 00:56:15 GMT',
'published_parsed': time.struct_time(tm_year=2019, tm_mon=5, tm_mday=8, tm_hour=0, tm_min=56, tm_sec=15, tm_wday=2, tm_yday=128, tm_isdst=0),
'summary': '<ol><li><a href="https://www.nytimes.com/2019/05/07/us/colorado-school-shooting.html" target="_blank">School Shooting in Colorado Leaves 1 Student Dead and 7 Injured</a> <font color="#6f6f6f">The New York Times</font></li><li><a href="https://www.foxnews.com/us/injuries-reported-unstable-situation-shots-fired-at-colorado-school-sheriff-says" target="_blank">Colorado school shooting leaves at least 1 dead, 7 injured, 2 in custody, sheriff\'s office says</a> <font color="#6f6f6f">Fox News</font></li><li><a href="https://www.cnn.com/2019/05/07/us/colorado-denver-area-school-shooting/index.html" target="_blank">Eight injured in school shooting in suburban Denver, 2 suspects are in custody</a> <font color="#6f6f6f">CNN</font></li><li><a href="https://kdvr.com/2019/05/07/president-trump-briefed-on-highlands-ranch-school-shooting/" target="_blank">President Trump briefed on Highlands Ranch school shooting</a> <font color="#6f6f6f">FOX 31 Denver</font></li><li><a href="https://www.oregonlive.com/nation/2019/05/sheriff-school-shooting-near-denver-injures-at-least-7.html" target="_blank">Sheriff: School shooting near Denver injures at least 7</a> <font color="#6f6f6f">OregonLive</font></li><li><strong><a href="https://news.google.com/stories/CAAqcQgKImtDQklTU2pvSmMzUnZjbmt0TXpZd1NqMEtFUWo1dV9ueWpZQU1FVWE5TGp2Z2NDNFJFaWhUYUc5MGN5Qm1hWEpsWkNCaGRDQnpZMmh2YjJ3Z2FXNGdTR2xuYUd4aGJtUnpJRkpoYm1Ob0tBQVAB?oc=5" target="_blank">View full coverage on Google News</a></strong></li></ol>',
'summary_detail': {'type': 'text/html',
'language': None,
'base': 'https://news.google.com/rss?hl=en-US&gl=US&ceid=US:en',
'value': '<ol><li><a href="https://www.nytimes.com/2019/05/07/us/colorado-school-shooting.html" target="_blank">School Shooting in Colorado Leaves 1 Student Dead and 7 Injured</a> <font color="#6f6f6f">The New York Times</font></li><li><a href="https://www.foxnews.com/us/injuries-reported-unstable-situation-shots-fired-at-colorado-school-sheriff-says" target="_blank">Colorado school shooting leaves at least 1 dead, 7 injured, 2 in custody, sheriff\'s office says</a> <font color="#6f6f6f">Fox News</font></li><li><a href="https://www.cnn.com/2019/05/07/us/colorado-denver-area-school-shooting/index.html" target="_blank">Eight injured in school shooting in suburban Denver, 2 suspects are in custody</a> <font color="#6f6f6f">CNN</font></li><li><a href="https://kdvr.com/2019/05/07/president-trump-briefed-on-highlands-ranch-school-shooting/" target="_blank">President Trump briefed on Highlands Ranch school shooting</a> <font color="#6f6f6f">FOX 31 Denver</font></li><li><a href="https://www.oregonlive.com/nation/2019/05/sheriff-school-shooting-near-denver-injures-at-least-7.html" target="_blank">Sheriff: School shooting near Denver injures at least 7</a> <font color="#6f6f6f">OregonLive</font></li><li><strong><a href="https://news.google.com/stories/CAAqcQgKImtDQklTU2pvSmMzUnZjbmt0TXpZd1NqMEtFUWo1dV9ueWpZQU1FVWE5TGp2Z2NDNFJFaWhUYUc5MGN5Qm1hWEpsWkNCaGRDQnpZMmh2YjJ3Z2FXNGdTR2xuYUd4aGJtUnpJRkpoYm1Ob0tBQVAB?oc=5" target="_blank">View full coverage on Google News</a></strong></li></ol>'},
'source': {'href': 'https://www.nytimes.com', 'title': 'The New York Times'}}