Question

我正在尝试使用BeautifulSoup将html解析为文本，但是我遇到了一个问题：某些单词被标签分隔，没有空格：

<span>word1</span><span>word2</space>

所以当我提取文本时，我有：

word1word2

有些句子也加入了一个句子：

INTODUCTION There are many...

是否有一种简单的方法可以通过BeautifulSoup在标签上强制单词分离？也可以在某些标签上固定句子分隔吗？

我有几个复杂的html文件。我将它们处理为以下文本：

plain_texts = [BeautifulSoup(html, "html.parser").get_text() for html in htmls]

Answer 1

您可以使用find_all()：

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html><html lang="en"><head><title>words</title></head><body><span>word1</span><span>word2</span></body></html>
"""

soup = BeautifulSoup(html_doc, 'lxml')
for span in soup.find_all('span'):
    print(span.text)

分别在<span>标签之间打印文本：

word1
word2

Answer 2

您可以使用replace_with()方法（docs here）对汤进行修补。但是很多取决于您HTML的结构：

from bs4 import BeautifulSoup

data = '''
<html><body><span>word1</span><span>word2</space>
'''

soup = BeautifulSoup(data, 'lxml')
for span in soup.select('span'):
    span.replace_with(span.text + ' ')

print(soup.text.strip())

此打印：

word1 word2

使用BeautifulSoup在标签边界上打断单词

2 个答案: