以下代码打印数据:
f = codecs.open('scrapeddata.csv', 'r')
data = f.read()
print data
数据如下所示:
Foul by Fabian Sch�r (Switzerland). Wayne Rooney (England) wins a free kick in the attacking half. Attempt missed. Xherdan Shaqiri (Switzerland) right footed shot from outside the box is high and wide to the right. Assisted by Josip Drmic. Booking James Milner (England) is shown the yellow card for a bad foul. Stephan Lichtsteiner (Switzerland) wins a free kick in the defensive half. Foul by James Milner (England). Offside, Switzerland. G�khan Inler tries a through ball, but Xherdan Shaqiri is caught offside.
然后,我尝试使用以下代码进行简单的词频分析:
from nltk import FreqDist, sent_tokenize, word_tokenize
data = word_tokenize(data)
freq = FreqDist(data)
freq
返回:
----> 3 data = word_tokenize(data)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x94 in position 14: ordinal not in range(128)
任何帮助?
答案 0 :(得分:1)
打开文件时提供显式编码。你说它是UTF-8,所以告诉Python:
f = codecs.open('scrapeddata.csv', 'r', 'utf-8')
data = f.read()