我收到的电子邮件看起来像是一大堆:
<p>Something something</p>
<p>Something else</p>
<a href="www.blahblah.com">Link</a>
因此,当我使用beautifulsoup抓取文本时,我得到以下内容:
Something something
Something else
Link
...但我想摆脱缩进。我正在尝试使用textwrap.dedent,但这并没有改变结果。另外 - 我如何保持链接?
当前代码:
no_html_message = BeautifulSoup(message).get_text()
formatted_message = textwrap.dedent(no_html_message)
更新: 运行print repr(no_html_message),所有缩进行在它们之前都有实际的空格......即......
\r\n content
答案 0 :(得分:0)
使用您的示例代码,打印时缩进很好:
html = """
<p>Something something</p>
<p>Something else</p>
<a href="www.blahblah.com">Link</a>
"""
soup = BeautifulSoup(html)
print soup.get_text()
Something something
Something else
Link
但如果你有空格,只需使用strip
BeautifulSoup(html).get_text().strip()
html = """
<p> Something something</p>
<p>Something else</p>
<a href="www.blahblah.com">Link</a>
"""
soup = BeautifulSoup(html)
print soup.get_text()
Something something # whitespace
Something else
Link
print soup.get_text().strip() # no whitespace
Something something
Something else
Link
答案 1 :(得分:0)
一种解决方案是将换行符的实例替换为后跟空格的换行符。
import re
...
...
...
actual_text = soup.get_text()
unintended_text = re.sub('\n[ ]*', '\n', actual_text)
print(unintended_text)