我正在尝试使用python中的BeautifulSoup从页面的源代码中获取时间范围。
我正在尝试解析的行看起来像这样
<span class="experience-date-locale"><time>June 2010</time> – <time>August 2010</time> (3 months)<span class="locality">New York</span></span>
<span class="experience-date-locale"><time>October 2015</time> – Present (7 months)</span>
<span class="experience-date-locale"><time>May 2010</time> – <time>October 2011</time> (6 months)</span>
我不知道怎么做对
这条线我尝试没有工作,因为有时我也有'locality'类......
soup.find('span', {'class': 'experience-date-locale'}).text
这不起作用,因为我错过了'现在'部分
soup.find('span', {'class': 'experience-date-locale'}).findAll('time').text
如何排除部分位置并获得时间?
结果应为:
June 2010 - August 2010 (3 months)
October 2015 - present (7 months)
May 2010 - October 2011 (6 month)
答案 0 :(得分:1)
您可以尝试删除额外的<span>
标记。
from bs4 import BeautifulSoup
html = '''<span class="experience-date-locale"><time>June 2010</time> – <time>August 2010</time> (3 months)<span class="locality">New York</span></span>
<span class="experience-date-locale"><time>October 2015</time> – Present (7 months)</span>
<span class="experience-date-locale"><time>May 2010</time> – <time>October 2011</time> (6 months)</span>'''
soup = BeautifulSoup(html)
for e in soup.find_all('span', {'class': 'experience-date-locale'}):
if e.span:
_ = e.span.extract()
print(e.text)
<强>输出强>
June 2010 – August 2010 (3 months)
October 2015 – Present (7 months)
May 2010 – October 2011 (6 months)
这提供了您想要的输出,但是,它确实改变了文档树。
答案 1 :(得分:0)
试试这个:
for span in soup.findAll("span", {"class": "experience-date-locale"}):
for child in span.contents:
if isinstance(child, bs4.element.Tag) and child.name == "time":
print(child.text, end='')
elif isinstance(child, bs4.element.NavigableString):
print(child, end='')
print()
输出:
June 2010 – August 2010 (3 months)
October 2015 – Present (7 months)
May 2010 – October 2011 (6 months)