我有一个pandas数据框,df看起来像这样:
column1
0 apple is a fruit
1 fruit sucks
2 apple tasty fruit
3 fruits what else
4 yup apple map
5 fire in the hole
6 that is true
我想生成一个column2,它是行中每个单词的列表,以及整个列中每个单词的总计数。所以输出会是这样的......
column1 column2
0 apple is a fruit [('apple', 3),('is', 2),('a', 1),('fruit', 3)]
1 fruit sucks [('fruit', 3),('sucks', 1)]
我尝试使用sklearn,但未能实现上述目标。需要帮助。
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
x = v.fit_transform(df['text'])
答案 0 :(得分:0)
这是提供所需结果的一种方法,尽管完全避免sklearn
:
def counts(data, column):
full_list = []
datr = data[column].tolist()
total_words = " ".join(datr).split(' ')
# per rows
for i in range(len(datr)):
#first per row get the words
word_list = re.sub("[^\w]", " ", datr[i]).split()
#cycle per word
total_row = []
for word in word_list:
count = []
count = total_words.count(word)
val = (word, count)
total_row.append(val)
full_list.append(total_row)
return full_list
df['column2'] = counts(df,'column1')
df
column1 column2
0 apple is a fruit [(apple, 3), (is, 2), (a, 1), (fruit, 3)]
1 fruit sucks [(fruit, 3), (sucks, 1)]
2 apple tasty fruit [(apple, 3), (tasty, 1), (fruit, 3)]
3 fruits what else [(fruits, 1), (what, 1), (else, 1)]
4 yup apple map [(yup, 1), (apple, 3), (map, 1)]
5 fire in the hole [(fire, 1), (in, 1), (the, 1), (hole, 1)]
6 that is true [(that, 1), (is, 2), (true, 1)]
答案 1 :(得分:-1)
我不知道您是否可以使用scikit-learn
执行此操作,但您可以编写一个函数,然后使用apply()
将其应用于DataFrame
或{{1 }}
以下是您如何为自己的榜样做准备:
Series
正如您所看到的,主要问题是test = pd.DataFrame(['apple is a fruit', 'fruit sucks', 'apple tasty fruit'], columns = ['A'])
def a_function(row):
splitted_row = str(row.values[0]).split()
word_occurences = []
for word in splitted_row:
column_occurences = test.A.str.count(word).sum()
word_occurences.append((word, column_occurences))
return word_occurences
test.apply(a_function, axis = 1)
# Output
0 [(apple, 2), (is, 1), (a, 4), (fruit, 3)]
1 [(fruit, 3), (sucks, 1)]
2 [(apple, 2), (tasty, 1), (fruit, 3)]
dtype: object
将计算test.A.str.count(word)
的所有出现次数,只要分配给word
的模式位于字符串中。这就是word
显示为4次的原因。这可能很容易通过一些正则表达式来修复(我不太擅长)。
或者,如果您愿意丢失一些词语,可以在上面的函数中使用此变通方法:
"a"