我正在将消息提取到pandas DataFrame中,并试图在数据上运行一些机器学习功能。当我运行标记化函数时,我得到一个错误KeyError:“...”基本上吐出其中一条消息的内容。查看字符串,会出现utf-8字符,例如\ xe2 \ x80 \ xa8(空格),\ xe2 \ x82 \ xac(欧元货币符号)。 这是错误的原因吗? 2.为什么这些符号不会像原始邮件或DataFrame中那样保留?
coding=utf-8
from __future__ import print_function
import sys
reload(sys)
sys.setdefaultencoding("utf8")
import os
import pandas as pd
path = '//directory1//'
data = []
for f in [f for f in os.listdir(path) if not f.startswith('.')]:
with open(path+f, "r") as myfile:
data.append(myfile.read().replace('\n', ' '))
df = pd.DataFrame(data, columns=["message"])
df["label"] = "1"
path = '//directory2//'
data = []
for f in [f for f in os.listdir(path) if not f.startswith('.')]:
with open(path+f, "r") as myfile:
data.append(myfile.read().replace('\n', ' '))
df2 = pd.DataFrame(data, columns=["message"])
df2["label"] = "0"
messages = pd.concat([df,df2], ignore_index=True)
import nltk
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfVectorizer
stopwords = nltk.corpus.stopwords.words('english')
def tokenize_only(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
return filtered_tokens
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.2, stop_words='english',
use_idf=True, tokenizer=tokenize_only, ngram_range=(1,2)) # analyzer = word
tfidf_matrix = tfidf_vectorizer.fit_transform(messages.message) #fit the vectorizer to corpora
terms = tfidf_vectorizer.get_feature_names()
totalvocab_tokenized = []
for i in emails.message:
# x = emails.message[i].decode('utf-8')
x = unicode(emails.message[i], errors="replace")
allwords_tokenized = tokenize_only(x)
totalvocab_tokenized.extend(allwords_tokenized)
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized})
print(vocal_frame)
我尝试将每个消息解码为utf-8,unicode,并且在最后一个for循环中没有这两行但是我一直收到错误。
有什么想法吗?
谢谢!
答案 0 :(得分:1)
您好像正在打印repr()
数据。如果无法打印UTF-8,Python可能会选择逃避它。打印实际字符串或Unicode
摆脱sys.setdefaultencoding("utf8")
和sys
重新加载 - 它会掩盖问题。如果您收到新的例外情况,请让我们对其进行调查。
使用自动解码打开文本文件。假设您的输入是UTF-8:
with io.open(path+f, "r", encoding="utf-8") as myfile: