Question

我正在尝试使用python中的BeautifulSoup从页面的源代码中获取时间范围。

我正在尝试解析的行看起来像这样

<span class="experience-date-locale"><time>June 2010</time> – <time>August 2010</time> (3 months)<span class="locality">New York</span></span>

<span class="experience-date-locale"><time>October 2015</time> – Present (7 months)</span>

<span class="experience-date-locale"><time>May 2010</time> – <time>October 2011</time> (6 months)</span>

我不知道怎么做对

这条线我尝试没有工作，因为有时我也有'locality'类......

soup.find('span', {'class': 'experience-date-locale'}).text

这不起作用，因为我错过了'现在'部分

soup.find('span', {'class': 'experience-date-locale'}).findAll('time').text

如何排除部分位置并获得时间？

结果应为：

June 2010 - August 2010 (3 months)

October 2015 - present (7 months)

May 2010 - October 2011 (6 month)

Answer 1

您可以尝试删除额外的<span>标记。

from bs4 import BeautifulSoup

html = '''<span class="experience-date-locale"><time>June 2010</time> – <time>August 2010</time> (3 months)<span class="locality">New York</span></span>
<span class="experience-date-locale"><time>October 2015</time> – Present (7 months)</span>
<span class="experience-date-locale"><time>May 2010</time> – <time>October 2011</time> (6 months)</span>'''

soup = BeautifulSoup(html)
for e in soup.find_all('span', {'class': 'experience-date-locale'}):
    if e.span:
        _ = e.span.extract()
    print(e.text)

<强>输出

June 2010 – August 2010 (3 months)
October 2015 – Present (7 months)
May 2010 – October 2011 (6 months)

这提供了您想要的输出，但是，它确实改变了文档树。

Answer 2

试试这个：

for span in soup.findAll("span", {"class": "experience-date-locale"}):
    for child in span.contents:
        if isinstance(child, bs4.element.Tag) and child.name == "time":
            print(child.text, end='')
        elif isinstance(child, bs4.element.NavigableString):
            print(child, end='')
    print()

输出：

June 2010 – August 2010 (3 months)
October 2015 – Present (7 months)
May 2010 – October 2011 (6 months)

BeautifulSoup - 找到类并排除另一个类

2 个答案: