Question

我有一个HTML代码如下：

<div class="content">
    <div class="title">
        <a id="hlAdv" class="title" href="./sample.aspx">
            <font size=2>Pretty Beauty Fiesta -1st Avenue Mall!</font>
        </a>
    </div>
    19<sup>th</sup> ~ 21<sup>st</sup> Apr 2013
</div>

我现在正在使用Python，并尝试使用BeatifulSoup来取消日期。我的期望是：

19th ~ 21st Apr 2013

我试过了：

find("div", {"class":"content"}).text

输出：

Pretty Beauty Fiesta -1st Avenue Mall!19th ~ 21st Apr 2013

而且，

find("div", {"class":"content"}).div.nextSibling

输出：

我尝试使用更多的nextSibling来获取内容，但我仍然无法正确地获得“2013年4月”。

如何获取我想要的数据？谢谢。

Answer 1

您的问题是，您想要div中跟随的所有文字。

您想在循环中使用.next_siblings：

content_div = soup.find('div', class_='content')
text = []
for elem in content_div.div.next_siblings:
    try:
        text.extend(elem.strings)
    except AttributeError:
        text.append(elem)
text = ' '.join(text).strip()

.next_siblings是一个生成器，它只生成.next_sibling个属性链，包括NavigableString个元素。

结果：

>>> ''.join(text).strip()
u'19th ~ 21st Apr 2013'

你如何处理这里的空白可能有点棘手;剥离之后最适用于此特定示例，但对于其他人，使用elem.stripped_strings和elem.strip()也可以正常工作。

Answer 2

这个怎么样？它使用element.nextSiblingGenerator遍历您关注的div之后的元素，并忽略最后的None。

d = s.find('div', {'class':'content'}).div

def all_text_after(element):
    for item in element.nextSiblingGenerator():
        if not item:
            continue
        elif hasattr(item, 'contents'):
            for c in item.contents:
                yield c
        else:
            yield item

text_parts = list(all_text_after(d))
# -> [u'\n    19', u'th', u' ~ 21', u'st', u' Apr 2013\n']

print ''.join(text_parts)
# ->     19th ~ 21st Apr 2013

从<div>获取字符串，不带标记</div>

2 个答案: