我有一个看起来像这样的XML文件:
?xml version="1.0" encoding="UTF-8"?>
<url>
<lastmod>2020-02-04T16:21:00+01:00</lastmod>
<loc>https://www.h.com</loc>
</url>
<url>
<lastmod>2020-01-31T17:17:00+01:00</lastmod>
<loc>https://www.h.com</loc>
</url>
<url>
<lastmod>2020-01-27T13:53:00+01:00</lastmod>
<loc>https://www.h.coml</loc>
</url>
如下所示的datetime.date:
datetime.date(2020, 02, 01)
如果<url>
标记中的日期早于给定的datetime.date,是否可以使用BeautifulSoup删除/忽略<lastmod>
标记的内容?
结果如下:
?xml version="1.0" encoding="UTF-8"?>
<url>
<lastmod>2020-02-04T16:21:00+01:00</lastmod>
<loc>https://www.h.com</loc>
</url>
有人可以帮忙吗?
答案 0 :(得分:1)
这可以吗?
import time
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<?xml version="1.0" encoding="UTF-8"?>
<url>
<lastmod>2020-02-04T16:21:00+01:00</lastmod>
<loc>https://www.h.com</loc>
</url>
<url>
<lastmod>2020-01-31T17:17:00+01:00</lastmod>
<loc>https://www.h.com</loc>
</url>
<url>
<lastmod>2020-01-27T13:53:00+01:00</lastmod>
<loc>https://www.h.coml</loc>
</url>
'''
doc = SimplifiedDoc(html)
urls = doc.urls
startTime = time.strptime("2020-2-1", "%Y-%m-%d")
removeList=[]
for url in urls:
lastmod = url.lastmod.html # Get lastmod
tm = time.strptime(lastmod[0:lastmod.find('+')], "%Y-%m-%dT%H:%M:%S")
if tm<startTime:
removeList.append(url)
n = len(removeList)
html = doc.html
while n>0: # Delete data in reverse order
n-=1
url = removeList[n]
html = html[0:url._start]+html[url._end:] # Delete url data
print (html.strip())
结果:
<?xml version="1.0" encoding="UTF-8"?>
<url>
<lastmod>2020-02-04T16:21:00+01:00</lastmod>
<loc>https://www.h.com</loc>
</url>
答案 1 :(得分:0)
如果您使用的是python> = 3.7,则可以通过以下方式将时间字符串(为方便起见,将其命名为const src = [{"scheduledOn":"2020-02-05T00:00:00","matches":[{"id":1,"homeTeamName":"BLUE","homeTeamId":1,"homeScore":1,"awayTeamName":"Red","awayTeamId":2},{"id":2,"homeTeamName":"Red","homeTeamId":2,"homeScore":1,"awayTeamName":"Yellow","awayTeamId":3}]},{"scheduledOn":"2020-01-06T00:00:00","matches":[{"id":3,"homeTeamName":"BLUE","homeTeamId":1,"homeScore":0,"awayTeamName":"Yellow","awayTeamId":3}]}]
let objectHavingIdOf2 = {}
src.find(({matches}) =>
matches.find(item =>
item.id == 2 ?
(objectHavingIdOf2 = item, true) :
false))
console.log(objectHavingIdOf2)
)转换为时间:
your_date_string
如果是较旧的python版本,则需要从时区中删除最后一个冒号
datetime.strptime(your_date_string, '%Y-%m-%dT%H:%M:%S%z')