我正在用2D绘制一组文本文档,我注意到一些异常值,我希望能够找出这些异常值。我正在使用原始文本,然后使用内置于SKLearn中的TfidfVectorizer。
vectorizer = TfidfVectorizer(max_df=0.5, max_features=None,
min_df=2, stop_words='english',
use_idf=True, lowercase=True)
corpus = make_corpus(root)
X = vectorizer.fit_transform(corpus)
要缩小为2D,我使用的是TruncatedSVD。
reduced_data = TruncatedSVD(n_components=2).fit_transform(X)
如果我想找到哪个文本文档具有最高的第二主成分(y轴),我该怎么做?
答案 0 :(得分:2)
因此,根据我的理解,您想知道哪个文档最大化特定主成分。这是我想出的玩具示例:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import numpy as np
corpus = [
'this is my first corpus',
'this is my second corpus which is longer than the first',
'here is yet another one, but it is brief',
'and watch out for number four chuggin along',
'blah blah blah my final sentence yada yada yada'
]
vectorizer = TfidfVectorizer(stop_words='english',
use_idf=True, lowercase=True)
# first get TFIDF matrix
X = vectorizer.fit_transform(corpus)
# second compress to two dimensions
svd = TruncatedSVD(n_components=2).fit(X)
reduced = svd.transform(X)
# now, find the doc with the highest 2nd prin comp
corpus[np.argmax(reduced[:, 1])]
哪个收益率:
'and watch out for number four chuggin along'