python TfidfVectorizer给出了typeError:csv文件中预期的字符串或类字节对象

时间:2017-05-12 20:48:58

标签: python csv scikit-learn tf-idf sklearn-pandas

我正在分析一个非常大的csv文件并尝试使用scikit从中提取tf-idf信息。不幸的是,我从未完成处理数据,因为它会抛出此类型错误。有没有办法以编程方式更改csv文件以消除此错误?这是我的代码:

    df = pd.read_csv("C:/Users/aidan/Downloads/papers/papers.csv", sep = None)
df =  df[pd.notnull(df)]

    n_features = 1000
    n_topics = 8
    n_top_words = 10
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,max_features=n_features,stop_words='english', lowercase = False)

tfidf = tfidf_vectorizer.fit_transform(df['paper_text'])

从最后一行引发错误。 提前谢谢!

Traceback (most recent call last):
  File "C:\Users\aidan\NIPS Analysis 2.0.py", line 35, in <module>
    tfidf = tfidf_vectorizer.fit_transform(df['paper_text'])
  File "c:\python\python36\lib\site-packages\sklearn\feature_extraction\text.py", line 1352, in fit_transform
    X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  File "c:\python\python36\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
    self.fixed_vocabulary_)
  File "c:\python\python36\lib\site-packages\sklearn\feature_extraction\text.py", line 762, in _count_vocab
    for feature in analyze(doc):
  File "c:\python\python36\lib\site-packages\sklearn\feature_extraction\text.py", line 241, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "c:\python\python36\lib\site-packages\sklearn\feature_extraction\text.py", line 216, in <lambda>
    return lambda doc: token_pattern.findall(doc)
TypeError: expected string or bytes-like object

3 个答案:

答案 0 :(得分:1)

您检查过df.dtypes吗?输出是什么?

您可以尝试将dtype=str作为参数添加到.read_csv()来电。

答案 1 :(得分:0)

在我的情况下,问题是我在数据框中有NaN。更换NaN可以帮助我。

df.fillna('0')

答案 2 :(得分:0)

以这种方式读取文件:

df = pd.read_csv("C:/Users/aidan/Downloads/papers/papers.csv",dtype=str)

实际上,您的元素类型应该是字符串。