Question

我是python脚本的新手，但我有一个非常简单的任务，我想执行，但我似乎被困在它。我想要完成的只是从.txt文件中读取数据并解析它。

我采取的步骤

我从学校网站下载了pdf文件，其中包含一系列课程http://info.sjsu.edu/cgi-bin/pdfserv?ftok=soc-fall-courses
我将pdf文件转换为.txt文件，只需将其保存为.txt文件
用Google搜索错误，发现它是某种编码问题
使用终端命令文件-I [filename]并返回结果sjsuclassdata.txt: text/plain; charset=unknown-8bit
在线使用多种方法尝试将文件转换为UTF-8编码，但无济于事

我收到的错误消息

Traceback (most recent call last):
  File "/Users/edward/MyPythonScripts/sjsuClassExtractor.py", line 25, in <module>
    regexMatches = lectureRegex.findall(file.read())
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 9: invalid continuation byte

正如你所看到的，我真的迷失了我应该从这里做的事情，我已经证实，如果我读取包含类似数据的不同文件，一切都会有效。

Answer 1

假设原始文本文件是ANSI编码的（默认使用Acrobat Reader的“另存为文本”选项），此命令会将其转换为utf-8：

iconv -f "iso-8859-1" -t "utf-8" sjsuclassdata.txt -o sjsuclassdata-utf8.txt

Unicode解码尝试从Python中的.txt文件读取数据时出错

1 个答案: