如何将CountVectorizer应用于数据帧的每一行?

时间:2019-10-04 13:30:42

标签: python pandas dataframe scikit-learn countvectorizer

我有一个数据框,说df有3列。 A和B列是一些字符串。 C列是数字变量。 Dataframe

我想通过将其传递给CountVectorizer将其转换为特征矩阵。

我将countVectorizer定义为:

cv = CountVectorizer(input='content', encoding='iso-8859-1', 
                     decode_error='ignore', analyzer='word',
                    ngram_range=(1), tokenizer=my_tokenizer, stop_words='english',
                    binary=True)

接下来,我将整个数据帧传递给cv.fit_transform(df),这不起作用。 我收到此错误: 无法解压缩不可迭代的int对象

接下来,我将数据框的每一行都隐藏到

sample = pdt_items["A"] + "," + pdt_items["C"].astype(str) + "," + pdt_items["B"]

然后我申请

cv_m = sample.apply(lambda row: cv.fit_transform(row))

我仍然收到错误: ValueError:可迭代原始文本文档,接收字符串对象。

请让我知道我要去哪里错了?或者是否需要采取其他方法?

2 个答案:

答案 0 :(得分:1)

尝试一下:

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

A = ['very good day', 'a random thought', 'maybe like this']
B = ['so fast and slow', 'the meaning of this', 'here you go']
C = [1, 2, 3]

pdt_items = pd.DataFrame({'A':A,'B':B,'C':C})

cv = CountVectorizer()

# use pd.DataFrame here to avoid your error and add your column name    
sample = pd.DataFrame(pdt_items['A']+','+pdt_items['B']+','+pdt_items['C'].astype('str'), columns=['Output'])

vectorized = cv.fit_transform(sample['Output'])

答案 1 :(得分:0)

借助@QuantStats的注释,我将cv应用于数据帧的每一行,如下所示:

row_input = df['column_name'].tolist()

kwds = []
for i in range(len(row_input)):
  cell_input = [row_input[i]]
  full_set = row_keywords(cell_input, 1,1)
  candidates = [x for x in full_set if x[1]> 1] # to extract frequencies more than 1
  kwds.append(candidates)

kwds_col = pd.Series(kwds)
df['Keywords'] = kwds_col

(“ row_keywords”是CountVectorizer的函数。)