Question

我有来自网站的html源代码：

from urllib.request import urlopen
url = 'http://...'
html = str(urlopen(url).read())

然后我将其保存在这样的文件中：

with open('/file/path', 'w') as f:
    f.write(html)

执行此操作时，源代码中的新行将替换为'\r\n'。我想知道如何删除这些字符或用它们的含义替换它们（新行，标签等）

我尝试使用html.replace('\r\n', '\n')，但它没有用。

Answer 1

read()上的

http.client.HTTPResponse（从urllib.request.urlopen获得）会返回bytes个对象。您不能简单地使用str将其转换为str(your_bytes_object)，因为这会将\r\n（打印为换行符）转换为\\r\\n（实际打印为{{1}的编码形式而不是新行）：

\r\n

相反，您必须使用>>> a_bytes_object = b'This is a test\r\nMore test' >>> str(a_bytes_object) "b'This is a test\\r\\nMore test'" >>> print(str(a_bytes_object)) b'This is a test\r\nMore test'解码给定的bytes对象。 bytes.decode(your_encoding)通常用作编码，如果您只需要将其解码为字符串以便写入文件：

latin-1

您也可以将编码传递给>>> a_bytes_object.decode("latin-1") 'This is a test\r\nMore test' >>> print(a_bytes_object.decode("latin-1")) This is a test More test作为第二个参数，而不是使用decode，即 str 代替 str(a_bytes_object, "latin-1")

或者，您可以简单地以二进制模式（a_bytes_object.decode("latin-1")）打开文件，并将bytes对象写入其中。

open('/file/path', 'wb')

您还可以尝试读取with open('/file/path', 'wb') as f: f.write(html)标题（类似Content-Type）以提取字符集，然后解码为正确的字符串，但这样做有风险，因为它不会始终有效（并非所有服务器发送标头，并非所有包含编码，并不是Python支持所有编码等。）。

Answer 2

我认为您将replace视为直接修改字符串的内容，而不是返回需要分配给新变量的内容。

from urllib.request import urlopen
url = 'http://www.google.com'
html = str(urlopen(url).read())

html_2 = html.replace('\r','')

with open('/file/path/filename.txt', 'w') as f:
    f.write(html_2)

替换特殊字符（\ n，\ r \ n等）

2 个答案: