此代码的目标是查找书中使用的单词的频率。
我想读一本书的文字,但是以下一行不断抛弃我的代码:
珍贵的protégés。不,先生们;他总是会给他们一个干净的配对
特别是é字符
我查看过以下文档,但我不太明白:https://docs.python.org/3.4/howto/unicode.html
继承我的代码:
import string
# Create word dictionary from the comprehensive word list
word_dict = {}
def create_word_dict ():
# open words.txt and populate dictionary
word_file = open ("./words.txt", "r")
for line in word_file:
line = line.strip()
word_dict[line] = 1
# Removes punctuation marks from a string
def parseString (st):
st = st.encode("ascii", "replace")
new_line = ""
st = st.strip()
for ch in st:
ch = str(ch)
if (n for n in (1,2,3,4,5,6,7,8,9,0)) in ch or ' ' in ch or ch.isspace() or ch == u'\xe9':
print (ch)
new_line += ch
else:
new_line += ""
# now remove all instances of 's or ' at end of line
new_line = new_line.strip()
print (new_line)
if (new_line[-1] == "'"):
new_line = new_line[:-1]
new_line.replace("'s", "")
# Conversion from ASCII codes back to useable text
message = new_line
decodedMessage = ""
for item in message.split():
decodedMessage += chr(int(item))
print (decodedMessage)
return new_line
# Returns a dictionary of words and their frequencies
def getWordFreq (file):
# Open file for reading the book.txt
book = open (file, "r")
# create an empty set for all Capitalized words
cap_words = set()
# create a dictionary for words
book_dict = {}
total_words = 0
# remove all punctuation marks other than '[not s]
for line in book:
line = line.strip()
if (len(line) > 0):
line = parseString (line)
word_list = line.split()
# add words to the book dictionary
for word in word_list:
total_words += 1
if (word in book_dict):
book_dict[word] = book_dict[word] + 1
else:
book_dict[word] = 1
print (book_dict)
# close the file
book.close()
def main():
wordFreq1 = getWordFreq ("./Tale.txt")
print (wordFreq1)
main()
我收到的错误如下:
Traceback (most recent call last):
File "Books.py", line 80, in <module>
main()
File "Books.py", line 77, in main
wordFreq1 = getWordFreq ("./Tale.txt")
File "Books.py", line 60, in getWordFreq
line = parseString (line)
File "Books.py", line 36, in parseString
decodedMessage += chr(int(item))
OverflowError: Python int too large to convert to C long
答案 0 :(得分:3)
在python中打开文本文件时,默认情况下编码为ANSI,因此它不包含échartecter。尝试
init
答案 1 :(得分:0)
尝试:
def parseString(st):
st = st.encode("ascii", "replace")
# rest of code here
您遇到的新错误是因为您在int上调用isalpha(即数字)
试试这个:
for ch in st:
ch = str(ch)
if (n for n in (1,2,3,4,5,6,7,8,9,0) if n in ch) or ' ' in ch or ch.isspace() or ch == u'\xe9':
print (ch)
答案 2 :(得分:0)
我能想到的最好的方法是将每个字符作为ASCII值读入数组,然后取char值。例如,97是&#34; a&#34;的ASCII。如果你做char(97),它将输出&#34; a&#34;。查看一些在线ASCII表,它们也为特殊字符提供值。