让我们假设我在pandas
中有一个包含两列的数据框,类似于以下内容:
text label
0 This restaurant was amazing Positive
1 The food was served cold Negative
2 The waiter was a bit rude Negative
3 I love the view from its balcony Positive
然后在此数据集上使用TfidfVectorizer
中的sklearn
。
根据每个班级的TF-IDF得分词汇量,找到前n个最有效的方法是什么?
显然,我的实际数据帧包含的数据行比上面的4行还要多。
我的帖子的重点是找到适用于任何与上述数据框相似的数据框的代码; 4行或1M行数据框。
我认为我的帖子与以下帖子有很多联系:
答案 0 :(得分:1)
在下面,您可以找到我三年前为类似目的编写的一段代码。我不确定这是否是您要做的最有效的方式,但据我所记得,它对我有用。
# X: data points
# y: targets (data points` label)
# vectorizer: TFIDF vectorizer created by sklearn
# n: number of features that we want to list for each class
# target_list: the list of all unique labels (for example, in my case I have two labels: 1 and -1 and target_list = [1, -1])
# --------------------------------------------
# splitting X vectors based on target classes
for label in target_list:
# listing the most important words in each class
indices = []
current_dict = {}
# finding indices the of rows (data points) for the current class
for i in range(0, len(X.toarray())):
if y[i] == label:
indices.append(i)
# get rows of the current class from tf-idf vectors matrix and calculating the mean of features values
vectors = np.mean(X[indices, :], axis=0)
# creating a dictionary of features with their corresponding values
for i in range(0, X.shape[1]):
current_dict[X.indices[i]] = vectors.item((0, i))
# sorting the dictionary based on values
sorted_dict = sorted(current_dict.items(), key=operator.itemgetter(1), reverse=True)
# printing the features textual and numeric values
index = 1
for element in sorted_dict:
for key_, value_ in vectorizer.vocabulary_.items():
if element[0] == value_:
print(str(index) + "\t" + str(key_) + "\t" + str(element[1]))
index += 1
if index == n:
break
else:
continue
break
答案 1 :(得分:0)
top_terms = pd.DataFrame(columns = range(1,6))
for i in term_doc_mat.index:
top_terms.loc[len(top_terms)] = term_doc_mat.loc[i].sort_values(ascending = False)[0:5].index
这将为您提供每个文档的前5个术语。根据需要进行调整。
答案 2 :(得分:0)
以下代码即可完成工作(感谢Mariia Havrylovych)。
假设我们有一个与您的结构对齐的输入数据框 df 。
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
# override scikit's tfidf-vectorizer in order to return dataframe with feature names as columns
class DenseTfIdf(TfidfVectorizer):
def __init__(self, **kwargs):
super().__init__(**kwargs)
for k, v in kwargs.items():
setattr(self, k, v)
def transform(self, x, y=None) -> pd.DataFrame:
res = super().transform(x)
df = pd.DataFrame(res.toarray(), columns=self.get_feature_names())
return df
def fit_transform(self, x, y=None) -> pd.DataFrame:
# run sklearn's fit_transform
res = super().fit_transform(x, y=y)
# convert the returned sparse documents-terms matrix into a dataframe to further manipulations
df = pd.DataFrame(res.toarray(), columns=self.get_feature_names(), index=x.index)
return df
# assume texts are stored in column 'text' within a dataframe
texts = df['text']
df_docs_terms_corpus = DenseTfIdf(sublinear_tf=True,
max_df=0.5,
min_df=2,
encoding='ascii',
ngram_range=(1, 2),
lowercase=True,
max_features=1000,
stop_words='english'
).fit_transform(texts)
# Need to keep alignment of indexes between the original dataframe and the resulted documents-terms dataframe
df_class = df[df["label"] == "Class XX"]
df_docs_terms_class = df_docs_terms_corpus.iloc[df_class.index]
# sum by columns and get the top n keywords
df_docs_terms_class.sum(axis=0).nlargest(n=50)