Question

我正在尝试将URL的内容保存到文本文件中。我在网上发现了几个示例脚本来做这个，下面的两个看起来像是帮助我做我想做的好候选人，但都返回了这个错误：

TypeError：需要类似字节的对象，而不是'str'

import html2text
import urllib.request

with urllib.request.urlopen("http://www.msnbc.com") as r:
    html_content = r.read()
rendered_content = html2text.html2text(html_content)
file = open('C:\\Users\\Excel\\Desktop\\URL.txt', 'w')
file.write(rendered_content)
file.close()



import sys
if sys.version_info[0] == 3:
    from urllib.request import urlopen
else:
    # Not Python 3 - today, it is most likely to be Python 2
    # But note that this might need an update when Python 4
    # might be around one day
    from urllib import urlopen
# Your code where you can use urlopen
with urlopen("http://www.msnbc.com") as r:
    s = r.read()
rendered_content = html2text.html2text(html_content)
file = open('C:\\Users\\Excel\\Desktop\\URL.txt', 'w')
file.write(rendered_content)
file.close()

我可能在这里遗漏了一些简单的东西，但我不知道它是什么。有人可以帮帮我吗？感谢。

BTW，我使用的是Python 3.6。

Answer 1

您需要在文字中添加方法解码（＆＃39; utf-8＆＃39;）：

with urlopen("http://www.msnbc.com") as r:
    s = r.read().decode('utf-8')

变量s包含一串字节，需要解码。出错的原因是unicode字符串和字节之间的区别问题：

Python 3的标准字符串类型是基于Unicode的，Python 3添加了专用字节类型，但关键的是，不提供字节和unicode字符串之间的自动强制。语言与隐式强制最接近的是一些基于文本的API，如果没有明确说明编码，则采用默认编码（通常为UTF-8）。因此，核心解释器，其I / O库，模块名称等在unicode字符串和字节之间的区别是明确的。 Python 3的unicode支持甚至扩展到文件系统，因此本机支持非ASCII文件名。

这种字符串/字节清晰度通常是将现有代码转换为Python 3的难点，因为许多第三方库和应用程序本身在这种区别中是模棱两可的。一旦迁移，大多数UnicodeErrors都可以消除。

来源：https://www.python.org/dev/peps/pep-0404/#strings-and-bytes

Answer 2

尝试：

str(content, encoding = "utf-8")

在你的代码中是：

rendered_content = html2text.html2text(str(html_content, encoding = "utf-8"))

将URL的内容保存到文本文件

2 个答案: