Question

我正在尝试使用BeautifulSoup提取一些文字。我正在使用get_text()函数来实现此目的。

我的问题是文本包含</br>标记，我需要将它们转换为结束行。我怎样才能做到这一点？

Answer 1

您可以使用BeautifulSoup对象本身或其中的任何元素来执行此操作：

for br in soup.find_all("br"):
    br.replace_with("\n")

Answer 2

正如official doc所说：

您可以指定用于将文本位连接在一起的字符串：soup.get_text（“\ n”）

Answer 3

正则表达式应该可以解决问题。

import re
s = re.sub('<br\s*?>', '\n', yourTextHere)

希望这有帮助！

Answer 4

在Ian和除法零度的帖子/评论中，您可以执行以下操作以有效过滤/替换许多标签：

for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
    elem.replace_with(elem.text + "\n\n")

Answer 5

与其用\ n代替标签，不如将\ n添加到所有重要标签的末尾。

要从@petezurich窃取列表，请执行以下操作：

for elem in soup.find_all(["a", "p", "div", "h3", "br"]):
    elem.append('\n')

Answer 6

如果您致电element.text，您将获得不带br标签的文本。为此，您可能需要定义自己的自定义方法：

     def clean_text(elem):
        text = ''
        for e in elem.descendants:
            if isinstance(e, str):
                text += e.strip()
            elif e.name == 'br' or e.name == 'p':
                text += '\n'
        return text

    # get page content
    soup = BeautifulSoup(request_response.text, 'html.parser')
    # get your target element
    description_div = soup.select_one('.description-class')
    # clean the data
    print(clean_text(description_div))

将<br/>转换为终点线

6 个答案: