Question

我写了一个脚本来抓取YouTube播放列表页面的标题

根据print语句，一切正常，直到我尝试将标题写入文本文件，此时我得到“UnicodeEncodeError：'charmap'编解码器无法对字符进行编码......”

我在打开文件时尝试添加“encoding ='utf8'”，虽然修复了错误，但所有中文字符都被随机的乱码字符替换

我也尝试使用'replace'对输出字符串进行编码，然后对其进行解码，但这也只是用问号替换所有特殊字符

这是我的代码：

from bs4 import BeautifulSoup as BS
import urllib.request
import re

playlist_url = input("gib nem: ")

with urllib.request.urlopen(playlist_url) as response:
  playlist = response.read().decode('utf-8')
  soup = BS(playlist, "lxml")

title_attrs = soup.find_all(attrs={"data-title":re.compile(r".*")})
titles = [tag["data-title"] for tag in title_attrs]

titles_str = '\n'.join(titles)#.encode('cp1252','replace').decode('cp1252')

print(titles_str)
with open("playListNames.txt", "a") as f:
    f.write(titles_str)

以下是我一直用来测试的示例播放列表： https://www.youtube.com/playlist?list=PL3oW2tjiIxvSk0WKXaEiDY78KKbKghOOo

Answer 1

documentation明确了文件编码：

encoding是用于解码或编码的编码的名称文件。这应该只在文本模式下使用。默认编码是平台依赖（无论locale.getpreferredencoding()返回），但是可以使用Python支持的任何文本编码。查看编解码器模块，用于支持的编码列表。

回答你上次评论中的问题。

您可以使用
找出Windows上的首选编码
```
import locale
locale.getpreferredencoding()
```

如果使用playListNames.txt创建open('playListNames.txt', 'w')，则locale.getpreferredencoding()返回的值将用于编码。

如果文件是手动创建的，则编码取决于编辑器的默认/首选编码。

请参阅How to convert a file to utf-8 in Python?或How do I convert an ANSI encoded file to UTF-8 with Notepad++? [closed]。

Answer 2

使用编码可以解决您的问题。 Windows默认为ANSI编码，在美国Windows上是Windows-1252。它不支持中文。您应该使用utf8或utf-8-sig作为编码。一些Windows编辑器更喜欢后者，否则就会采用ANSI。

with open('playListNames.txt','w',encoding='utf-8-sig') as f:

如何将中文字符和英文字符写入文件（Python 3）？

2 个答案: