这是我现在的代码,我使用的csvfile有两列,一列有文本,一列有它所属的对话。现在我已经设法从文本中获取不同的ngram,但我也希望将对话的数量链接到ngram。所以如果一个ngram出现x次我想看看它出现在哪个对话中。我该怎么做?
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
df = pd.read_csv("F:/textclustering/data/filteredtext1.csv", encoding="iso-8859-1" ,low_memory=False)
document = df['Data']
vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(document)
matrix_terms = np.array(vectorizer.get_feature_names())
matrix_freq = np.asarray(X.sum(axis=0)).ravel()
terms = vectorizer.get_feature_names()
freqs = X.sum(axis=0).A1
dictionary = dict(zip(terms, freqs))
df = pd.DataFrame(dictionary,index=[0]).T.reindex()
df.to_csv("F:/textclustering/data/terms2.csv", sep=',', na_rep="none")
输入CSV
text, id
example text is great, 1
this is great, 2
example text is great, 3
期望的输出(或接近此的东西)
ngram, count, id
example text, 2, [1,3]
text is, 2, [1,3]
is great, 3, [1,2,3]
this is, 1, [1]
答案 0 :(得分:1)
首先,我们要将文档转换为csr稀疏矩阵,然后转换为coo矩阵。 COO矩阵允许您获取稀疏元素的行和列的位置。
from itertools import groupby
from sklearn.feature_extraction.text import CountVectorizer
ls = [['example text is great', 1],
['this is great', 2],
['example text is great', 3]]
document = [l[0] for l in ls]
vectorizer = CountVectorizer(ngram_range=(2, 2))
X = vectorizer.fit_transform(document)
X = X.tocoo()
然后你可以分组列(对于你拥有的每个二元组)。这里有一个小技巧,你必须先按列排序元组。然后,对于每一行,您可以用您的二元组替换行中的索引。我使用字典名id2vocab
output = []
id2vocab = dict((v,k) for k,v in vectorizer.vocabulary_.items())
zip_rc = sorted(zip(X.col, X.row), key=lambda x: x[0]) # group by column (vocab)
count = np.ravel(X.sum(axis=0)) # simple sum column for count
for g in groupby(zip_rc, key=lambda x: x[0]):
index = g[0]
bigram = id2vocab[index]
loc = [g_[1] for g_ in g[1]]
c = count[index]
output.append([index, bigram, c, loc])
输出将如下所示
[[0, 'example text', 2, [0, 2]],
[1, 'is great', 3, [0, 1, 2]],
[2, 'text is', 2, [0, 2]],
[3, 'this is', 1, [1]]]