我使用scikit中的TfidfVectorizer学习从文本数据中提取某些特征。我有一个带有分数的CSV文件(可以是+1或-1)和一个评论(文本)。我将这些数据导入DataFrame,因此我可以运行Vectorizer。
这是我的代码:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
df = pd.read_csv("train_new.csv",
names = ['Score', 'Review'], sep=',')
# x = df['Review'] == np.nan
#
# print x.to_csv(path='FindNaN.csv', sep=',', na_rep = 'string', index=True)
#
# print df.isnull().values.any()
v = TfidfVectorizer(decode_error='replace', encoding='utf-8')
x = v.fit_transform(df['Review'])
这是我得到的错误的追溯:
Traceback (most recent call last):
File "/home/PycharmProjects/Review/src/feature_extraction.py", line 16, in <module>
x = v.fit_transform(df['Review'])
File "/home/b/hw1/local/lib/python2.7/site- packages/sklearn/feature_extraction/text.py", line 1305, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 817, in fit_transform
self.fixed_vocabulary_)
File "/home/b/work/local/lib/python2.7/site- packages/sklearn/feature_extraction/text.py", line 752, in _count_vocab
for feature in analyze(doc):
File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 238, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/home/b/work/local/lib/python2.7/site-packages/sklearn/feature_extraction/text.py", line 118, in decode
raise ValueError("np.nan is an invalid document, expected byte or "
ValueError: np.nan is an invalid document, expected byte or unicode string.
我检查了CSV文件和DataFrame以查找被读为NaN的任何内容,但我找不到任何内容。有18000行,其中没有一行返回isnan
为True。
这就是df['Review'].head()
的样子:
0 This book is such a life saver. It has been s...
1 I bought this a few times for my older son and...
2 This is great for basics, but I wish the space...
3 This book is perfect! I'm a first time new mo...
4 During your postpartum stay at the hospital th...
Name: Review, dtype: object
答案 0 :(得分:75)
您需要将dtype object
转换为unicode
字符串,如回溯中明确提到的那样。
x = v.fit_transform(df['Review'].values.astype('U')) ## Even astype(str) would work
从TFIDF Vectorizer的Doc页面:
fit_transform(raw_documents,y = None)
参数:raw_documents:iterable
产生 str , unicode 或文件对象的可迭代
答案 1 :(得分:3)
我找到了解决此问题的更有效方法。
x = v.fit_transform(df['Review'].apply(lambda x: np.str_(x)))
当然,您可以使用df['Review'].values.astype('U')
来转换整个系列。但是我发现如果您要转换的Series很大,使用此功能将消耗更多的内存。 (我用80w行数据的Series进行了测试,这样做astype('U')
将消耗大约96GB的内存)
相反,如果您使用lambda表达式仅将系列中的数据从str
转换为numpy.str_
,则结果也将被fit_transform
函数接受,这将更快,并且不会增加内存使用量。
我不确定为什么会这样,因为在TFIDF Vectorizer的“文档”页面中:
fit_transform(raw_documents,y = None)
参数:raw_documents:可迭代
一个可迭代的对象,它产生str,unicode或文件对象
但是实际上,此可迭代项必须产生np.str_
而不是str
。
答案 2 :(得分:1)
即使在数据集中的评论中使用.values.astype('U')
后,我仍然收到MemoryError。
所以我尝试了.astype('U').values
并成功了。
这是来自Python: how to avoid MemoryError when transform text data into Unicode using astype('U')
的答案