使用BeautifulSoup根据日期删除条目

时间:2020-02-06 16:51:32

标签: python beautifulsoup

我有一个看起来像这样的XML文件:

?xml version="1.0" encoding="UTF-8"?>
  <url>
    <lastmod>2020-02-04T16:21:00+01:00</lastmod>
    <loc>https://www.h.com</loc>
  </url>
  <url>
    <lastmod>2020-01-31T17:17:00+01:00</lastmod>
    <loc>https://www.h.com</loc>
  </url>
  <url>
    <lastmod>2020-01-27T13:53:00+01:00</lastmod>
    <loc>https://www.h.coml</loc>
  </url>

如下所示的datetime.date:

datetime.date(2020, 02, 01)

如果<url>标记中的日期早于给定的datetime.date,是否可以使用BeautifulSoup删除/忽略<lastmod>标记的内容?

结果如下:

?xml version="1.0" encoding="UTF-8"?>
  <url>
    <lastmod>2020-02-04T16:21:00+01:00</lastmod>
    <loc>https://www.h.com</loc>
  </url>

有人可以帮忙吗?

2 个答案:

答案 0 :(得分:1)

这可以吗?

import time
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<?xml version="1.0" encoding="UTF-8"?>
<url>
  <lastmod>2020-02-04T16:21:00+01:00</lastmod>
  <loc>https://www.h.com</loc>
</url>
<url>
  <lastmod>2020-01-31T17:17:00+01:00</lastmod>
  <loc>https://www.h.com</loc>
</url>
<url>
  <lastmod>2020-01-27T13:53:00+01:00</lastmod>
  <loc>https://www.h.coml</loc>
</url>
'''
doc = SimplifiedDoc(html)
urls = doc.urls
startTime = time.strptime("2020-2-1", "%Y-%m-%d")
removeList=[]
for url in urls:
  lastmod = url.lastmod.html # Get lastmod
  tm = time.strptime(lastmod[0:lastmod.find('+')], "%Y-%m-%dT%H:%M:%S")
  if tm<startTime:
    removeList.append(url)
n = len(removeList)
html = doc.html
while n>0: # Delete data in reverse order
  n-=1
  url = removeList[n]
  html = html[0:url._start]+html[url._end:] # Delete url data
print (html.strip())

结果:

<?xml version="1.0" encoding="UTF-8"?>
<url>
  <lastmod>2020-02-04T16:21:00+01:00</lastmod>
  <loc>https://www.h.com</loc>
</url>

答案 1 :(得分:0)

如果您使用的是python> = 3.7,则可以通过以下方式将时间字符串(为方便起见,将其命名为const src = [{"scheduledOn":"2020-02-05T00:00:00","matches":[{"id":1,"homeTeamName":"BLUE","homeTeamId":1,"homeScore":1,"awayTeamName":"Red","awayTeamId":2},{"id":2,"homeTeamName":"Red","homeTeamId":2,"homeScore":1,"awayTeamName":"Yellow","awayTeamId":3}]},{"scheduledOn":"2020-01-06T00:00:00","matches":[{"id":3,"homeTeamName":"BLUE","homeTeamId":1,"homeScore":0,"awayTeamName":"Yellow","awayTeamId":3}]}] let objectHavingIdOf2 = {} src.find(({matches}) => matches.find(item => item.id == 2 ? (objectHavingIdOf2 = item, true) : false)) console.log(objectHavingIdOf2) )转换为时间:

your_date_string

如果是较旧的python版本,则需要从时区中删除最后一个冒号

datetime.strptime(your_date_string, '%Y-%m-%dT%H:%M:%S%z')