Question

我想为一组文档计算TF_IDF（10）。我使用Python Anaconda。

import nltk
import string
import os

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer

path = '/opt/datacourse/data/parts'
token_dict = {}
stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = []
for item in tokens:
    stemmed.append(stemmer.stem(item))
return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

for subdir, dirs, files in os.walk(path):
    for file in files:
    file_path = subdir + os.path.sep + file
    shakes = open(file_path, 'r')
    text = shakes.read()
    lowers = text.lower()
    no_punctuation = lowers.translate(None, string.punctuation)
    token_dict[file] = no_punctuation

    tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
    tfs = tfidf.fit_transform(token_dict.values())

但是在打印tfs = tfidf.fit_transform(token_dict.values())后，我收到以下错误消息。

UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 1412: invalid start byte

如何修复此错误？

Answer 1

我使用相同的参考进行数据预处理并得到完全相同的错误。这些是我在Ubuntu 14.04 Machine上使用Pyhton 2.7获得完美工作代码的几个步骤，

1）使用＆＃34;编解码器＆＃34;打开文件并设置＆＃34;编码＆＃34;参数为ISO-8859-1。这是你如何做的

import codecs
with codecs.open(pathToYourFileWithFileName,"r",encoding = "ISO-8859-1") as file_handle:

2）当您执行此第一步时，在使用

时遇到第二个问题

no_punctuation = lowers.translate(None, string.punctuation)

这里解释string.translate() with unicode data in python

解决方案就像

lowers = text.lower()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
no_punctuation = lowers.translate(remove_punctuation_map)

我希望它有所帮助。

Answer 2

您的数据采用其他编码进行编码：）

要解码字符串中的数据，请使用以下

myvar.decode("ENCODING")

编码可以是任何编码名称。该功能在后台执行，在＆＃34; utf-8＆＃34;。

上解码

你应该尝试＆＃34; latin1＆＃34;或＆＃34; latin2＆＃34 ;;使用utf-8的两者都是最常用的

干杯

Anaconda：UnicodeDecodeError：＆＃39; utf8＆＃39;编解码器不能解码位置1412中的字节0x92：无效的起始字节

2 个答案: