我已经编写了创建频率表的代码。但它在ext_string = document_text.read().lower(
行处中断。我什至试了一下,除了发现错误,但这没有帮助。
import re
import string
frequency = {}
file = open('EVG_text mining.txt', encoding="utf8")
document_text = open('EVG_text mining.txt', 'r')
text_string = document_text.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text_string)
for word in match_pattern:
try:
count = frequency.get(word,0)
frequency[word] = count + 1
except UnicodeDecodeError:
pass
frequency_list = frequency.keys()
for words in frequency_list:
print (words, frequency[words])
答案 0 :(得分:1)
您两次打开文件,第二次未指定编码:
file = open('EVG_text mining.txt', encoding="utf8")
document_text = open('EVG_text mining.txt', 'r')
您应按以下步骤打开文件:
frequencies = {}
with open('EVG_text mining.txt', encoding="utf8", mode='r') as f:
text = f.read().lower()
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text)
...
第二次打开文件时,没有定义要使用的编码,这可能是错误的原因。 with语句有助于执行与文件的I / O链接的某些任务。您可以在这里了解更多信息:https://www.pythonforbeginners.com/files/with-statement-in-python
您可能应该看一下错误处理,并且不要将实际引起错误的行放在一行:https://www.pythonforbeginners.com/error-handling/
忽略所有解码问题的代码:
import re
import string # Do you need this?
with open('EVG_text mining.txt', mode='rb') as f: # The 'b' in mode changes the open() function to read out bytes.
bytes = f.read()
text = bytes.decode('utf-8', 'ignore') # Change 'ignore' to 'replace' to insert a '?' whenever it finds an unknown byte.
match_pattern = re.findall(r'\b[a-z]{3,15}\b', text)
frequencies = {}
for word in match_pattern: # Your error handling wasn't doing anything here as the error didn't occur here but when reading the file.
count = frequencies.setdefault(word, 0)
frequencies[word] = count + 1
for word, freq in frequencies.items():
print (word, freq)
答案 1 :(得分:-1)
要读取带有某些特殊字符的文件,请将编码用作“ latin1”或“ unicode_escape”