我是python编程的新手。强调非常。我正在尝试建立我的第一个网络抓取项目(用于新闻报道的策划)。
我已经设法抓取新闻网站,并创建了一个循环,以所需的方式组织结果。我的问题是我计划每天刮一次网页,但只针对当天发布的出版物。我不需要所有这些,因为那意味着我会得到很多重复。
我知道这与通过datetime模块(带有if语句)转换日期有关,但是对于我来说,我一直找不到使它起作用的方法。
在html中,这是日期显示方式的示例:
<time datetime="2019-02-24T10:30:46+00:00">Feb 24, 2019 at 10:30</time>
这是我到目前为止所拥有的:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from datetime import datetime
my_url = "https://www.coindesk.com/category/business-news/legal"
# Opening up the website, grabbing the page
uFeedOne = uReq(my_url, timeout=5)
page_one = uFeedOne.read()
uFeedOne.close()
# html parser
page_soup1 = soup(page_one, "html.parser")
# grabs each publication block
containers = page_soup1.findAll("a", {"class": "stream-article"} )
for container in containers:
link = container.attrs['href']
publication_date = "published on " + container.time.text
title = container.h3.text
description = "(CoinDesk)-- " + container.p.text
print("link: " + link)
print("publication_date: " + publication_date)
print("title: " + title)
print("description: " + description)
答案 0 :(得分:0)
日期以ISO 8601格式表示。从datetime
标记中将time
属性提取为字符串。如果您使用的是python 3.7,则可以使用datetime.datetime.fromisoformat
方法将此方法转换为datetime对象,然后进行比较。如果您使用的是旧版本的python,我认为最简单的方法是查看此question和提供的第一个答案。
答案 1 :(得分:0)
您的time
标记具有datetime属性,该属性提供比文本更好的日期时间表示。使用它。
您可以使用dateutil包来解析字符串。以下是示例代码:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from datetime import datetime, timedelta
from dateutil import parser
import pytz
my_url = "https://www.coindesk.com/category/business-news/legal"
# Opening up the website, grabbing the page
uFeedOne = uReq(my_url, timeout=5)
page_one = uFeedOne.read()
uFeedOne.close()
# html parser
page_soup1 = soup(page_one, "html.parser")
# grabs each publication block
containers = page_soup1.findAll("a", {"class": "stream-article"} )
for container in containers:
## get todays date.
## I have taken an offset as the site has older articles than today.
today = datetime.now() - timedelta(days=5)
link = container.attrs['href']
## The actual datetime string is in the datetime attribute of the time tag.
date_time = container.time['datetime']
## we will use the dateutil package to parse the ISO-formatted date.
date = parser.parse(date_time)
## This date is UTC localised but the datetime.now() gives a "naive" date
## So we have to localize before comparison
utc=pytz.UTC
today = utc.localize(today)
## simple comparison
if date >= today:
print("article date", date)
print("yesterday", today," \n")
publication_date = "published on " + container.time.text
title = container.h3.text.encode('utf-8')
description = "(CoinDesk)-- " + container.p.text
print("link: " + link)
print("publication_date: " + publication_date)
print("title: ", title)
print("description: " + description)