如何使用Beautifulsoup对基于嵌套标签的文本进行切片和重组?

时间:2019-09-05 17:12:54

标签: python python-3.x web-scraping beautifulsoup

在下面的html中,我需要按顺序阅读所有文本,并为每个span类组合单独的句子。

<label for="01">"The traveler, with his powerful "
    <span class ="Wizard">"Storm"</span>
    <span class ="Warrior">"Whirlwind"</span>
    <span class ="Monk">"Prayer"</span>", took down the dark forces of evil. The "
    <span class ="Wizard">"wizard"</span>
    <span class ="Warrior">"warrior"</span>
    <span class ="Monk">"monk"</span>" was exhausted afterwards and needed to take a rest."
</label>

在这种情况下,应该有3个单独的句子,并在列表列表中包含相应的类-因此输出应如下所示:

[['Wizard', 'The traveler, with his powerful Storm, took down the dark forces of evil. The wizard was exhausted afterwards and needed to take a rest.']
['Warrior', 'The traveler, with his powerful Whirlwind, took down the dark forces of evil. The warrior was exhausted afterwards and needed to take a rest.']
['Monk', 'The traveler, with his powerful Prayer, took down the dark forces of evil. The monk was exhausted afterwards and needed to take a rest.']]

我不知道该如何处理,并且在网上找不到任何内容-可能是因为我不确定如何提出我的问题(如果您有建议如何更好地提出我的问题,请留下发表评论,我会的。)

提前谢谢!

编辑:我尝试使用find(text=True)find_all(text=True),但是我不知道该怎么做。

1 个答案:

答案 0 :(得分:0)

您可以使用itertools.groupby

import bs4
from bs4 import BeautifulSoup as soup
from itertools import groupby
d = [(a, list(b)) for a, b in groupby(list(filter(lambda x:x != '\n', soup(content, 'html.parser').label.contents)), key=lambda x:isinstance(x, bs4.element.NavigableString))]
users, _text = list(zip(*[b for a, b in d if not a])), [b for a, b in d if a]
result = [[a[0]['class'][0], (lambda x:''.join(f'{j[1:-1]} {next(x).text[1:-1]}' if l < len(_text) - 1 else j[1:-2] for l, [j] in enumerate(_text)))(iter(a))] for a in users]

输出中还有其他\n个字符,您可以使用re删除它们:

import re
final_result = [[a, re.sub('"\n\s+', '', b)] for a, b in result]

输出:

[['Wizard', 'The traveler, with his powerful Storm, took down the dark forces of evil. The wizard was exhausted afterwards and needed to take a rest.'], 
 ['Warrior', 'The traveler, with his powerful Whirlwind, took down the dark forces of evil. The warrior was exhausted afterwards and needed to take a rest.'], 
 ['Monk', 'The traveler, with his powerful Prayer, took down the dark forces of evil. The monk was exhausted afterwards and needed to take a rest.']]