Question

所以我试图从.txt文件中读取数据，然后找到最常见的30个单词并打印出来。但是，每当我读取我的txt文件时，都会收到错误：

"UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 338: ordinal not in range(128)".

这是我的代码：

filename = 'wh_2015_national_security_strategy_obama.txt'
#catches the year of named in the file
year = filename[0:4]
ecount = 30
#opens the file and reads it
file = open(filename,'r').read()   #THIS IS WHERE THE ERROR IS
#counts the characters, then counts the lines, replaces the non word characters, slipts the list and changes it all to lower case.
numchar = len(file)
numlines = file.count('\n')
file = file.replace(",","").replace("'s","").replace("-","").replace(")","")
words = file.lower().split()
dictionary = {}
#this is a dictionary of all the words to not count for the most commonly used. 
dontcount = {"the", "of", "in", "to", "a", "and", "that", "we", "our", "is", "for", "at", "on", "as", "by", "be", "are", "will","this", "with", "or",
             "an", "-", "not", "than", "you", "your", "but","it","a","and", "i", "if","they","these","has","been","about","its","his","no"
             "because","when","would","was", "have", "their","all","should","from","most", "were","such","he", "very","which","may","because","--------"
             "had", "only", "no", "one", "--------", "any", "had", "other", "those", "us", "while",
             "..........", "*", "$", "so", "now","what", "who", "my","can", "who","do","could", "over", "-",
             "...............","................", "during","make","************",
             "......................................................................", "get", "how", "after",
             "..................................................", "...........................", "much", "some",
             "through","though","therefore","since","many", "then", "there", "–", "both", "them", "well", "me", "even", "also", "however"}
for w in words:
    if not w in dontcount:
        if w in dictionary:
            dictionary[w] +=1
        else:
            dictionary[w] = 1
num_words = sum(dictionary[w] for w in dictionary)
#This sorts the dictionary and makes it so that the most popular is at the top.
x = [(dictionary[w],w) for w in dictionary]
x.sort()
x.reverse()
#This prints out the number of characters, line, and words(not including stop words.
print(str(filename))
print('The file has ',numchar,' number of characters.')
print('The file has ',numlines,' number of lines.')
print('The file has ',num_words,' number of words.')
#This provides the stucture for how the most common words should be printed out
i = 1
for count, word in x[:ecount]:
    print("{0}, {1}, {2}".format(i,count,word))
    i+=1

Answer 1

在Python 3中，当以文本模式（默认）打开文件时，Python使用您的环境设置来选择合适的编码。

如果它无法解决（或您的环境专门定义ASCII），那么它将使用ASCII。这就是你的情况。

如果ASCII解码器发现任何不是ASCII的内容，那么它将引发错误。在你的情况下，它在字节0x92上抛出一个错误。这不是有效的ASCII，也不是有效的UTF-8。然而，在windows-1252编码中它确实有意义，它是’（智能引用/＆＃39;正确的单引号标记＆＃39;）。它在其他8位代码页中也有意义，但您必须自己了解或解决这个问题。

要使代码读取windows-1252个编码文件，您需要将open()命令更改为：

file = open(filename, 'r', encoding='windows-1252').read()

Answer 2

我正在学习python，所以请记住这个回答。

file = open（filename，'r'）。read（）#THIS是错误的地方

从我到目前为止学到的知识，你的阅读与open（）对象的创建相结合。 open（）函数创建文件句柄，read（）函数将文件读入字符串。这两个函数都会返回我假设成功/失败，或者在open（）函数的情况下返回部分文件对象引用。我不确定它们是否可以成功合并。

远非我所学到的，这将分两步完成。即

file = open（filename，'r'）＃创建对象 myString = file.read（）＃将整个对象读入字符串

open（）函数创建文件对象，因此可能返回对象编号，或者成功/失败。

在对象上使用read，read（n），readline（）或readlines（）函数。

.read将整个文件读入单个字符串 .read（n）将下一个n字节读入字符串 .readline（）将下一行读入字符串 .readline（）将整个文件读入字符串列表

您可以拆分它们，看看是否会发生相同的结果???只是一个来自新手的想法：）

UnicodeDecodeError：'ascii'编解码器无法解码字节0x92？

2 个答案: