Question

我有一项任务，从HTML页面获取原始文本。 HTML解析后，我收到一个包含很多'\ n'符号的字符串。当我尝试用空替换它时，替换功能不起作用。这是我的代码：

from bs4 import BeautifulSoup
import urllib
with urllib.request.urlopen('http://shakespeare.mit.edu/lear/full.html') as response:
lear_bytes = response.read()
lear_html = str(lear_bytes)
soup = BeautifulSoup(lear_html, 'html.parser')
lear_txt_dirty = soup.get_text()
lear_txt_clean = str.replace(lear_txt_dirty, '\n', '')
print(lear_txt_clean)

Answer 1

在整理字符串问题时，打印字符串的repr很有用，这样你就可以看到真正的字符串了。用以下内容替换打印件：

#print(lear_txt_clean)
print("Num newlines", lear_txt_clean.count('\n'))
print(repr(lear_txt_clean[:80]))

我得到了

Num newlines 0
"b'\\n \\n \\n King Lear: Entire Play\\n \\n \\n \\n \\n \\n\\n\\nKing Lear\\n\\n      Shakesp"

您正在处理文本的python字节表示，而不是真实文本。在您的代码中，lear_bytes是bytes对象，但lear_html = str(lear_bytes)不解码对象，它为您提供bytes对象的python表示。相反，您应该让BeautifulSoup拥有原始字节并让它进行排序：

from bs4 import BeautifulSoup
import urllib
with urllib.request.urlopen('http://shakespeare.mit.edu/lear/full.html') as response:
    soup = BeautifulSoup(response.read(), 'html.parser')
lear_txt_dirty = soup.get_text()
lear_txt_clean = str.replace(lear_txt_dirty, '\n', '')
print(lear_txt_clean[:80])

为什么python3中的replace（）不适用于长字符串

1 个答案: