我想编写一个简单的python程序来查找文件中最常用的单词。 我的文件内容如下所示:
<text>աաա բբբ գգգ աաաա բբբ</text>
.....
<text>բբբ աաագգգ աաաա բբբ</text>
.....
<text>աաաաաաա բբբ հհհհ բբբ գգգ </text>
这是我的Python代码:
# -*- coding: utf-8 -*-
import re
import collections
a = open('dump.txt', encoding='UTF-8', errors='replace')
contents = a.read()
articlelist = re.findall(r'<text[^>]+>([^<]+)</text>', contents, re.M)
wordsandnumber = []
for article in articlelist:
wordsinarticle = re.findall(r'\w+', article)
for finaly in wordsinarticle:
wordsandnumber.append(finaly)
counter = collections.Counter(wordsandnumber)
mylist = counter.most_common()
open('as.txt', 'w').write('\n'.join('%s %s' % x for x in mylist))
print(counter.most_common())
但由于此错误,代码无法正常运行:
Traceback (most recent call last):
File "C:\Users\Home\Downloads\test.py", line 14, in <module>
open('as.txt', 'w').write('\n'.join('%s %s' % x for x in mylist))
File "C:\Users\Home\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0567' in position 0: character maps to <undefined>
我是编程的初学者./请帮我解决这个问题并理解为什么这段代码不起作用。
如果它很重要:我使用的是Windows 10和Python 3.5。