我正在使用定义为here的文档相似性。
我的问题是如何从numpy.ndarray
获取最相关的文档有没有办法对numpy数组进行排序并获得相似的top-K相关文档?
以下是示例代码。
from sklearn.feature_extraction.text import TfidfVectorizer
poem = ["All the world's a stage",
"And all the men and women merely players",
"They have their exits and their entrances",
"And one man in his time plays many parts",
"His acts being seven ages. At first, the infant",
"Mewling and puking in the nurse's arms",
"And then the whining school-boy, with his satchel",
"And shining morning face, creeping like snail",
"Unwillingly to school. And then the lover",
"Sighing like furnace, with a woeful ballad",
"Made to his mistress' eyebrow. Then a soldier",
"Full of strange oaths and bearded like the pard",
"Jealous in honour, sudden and quick in quarrel",
"Seeking the bubble reputation",
"Even in the cannon's mouth. And then the justice",
"In fair round belly with good capon lined",
"With eyes severe and beard of formal cut",
"Full of wise saws and modern instances",
"And so he plays his part. The sixth age shifts",
"Into the lean and slipper'd pantaloon",
"With spectacles on nose and pouch on side",
"His youthful hose, well saved, a world too wide",
"For his shrunk shank; and his big manly voice",
"Turning again toward childish treble, pipes",
"And whistles in his sound. Last scene of all",
"That ends this strange eventful history",
"Is second childishness and mere oblivion",
"Sans teeth, sans eyes, sans taste, sans everything"]
vect = TfidfVectorizer(min_df=1)
tfidf = vect.fit_transform(poem)
result = (tfidf * tfidf.T).A
print(type(result))
print(result)
答案 0 :(得分:1)
将diag元素设置为零,然后使用argsort()
在展平数组中查找top-K索引,并使用unravel_index()
将1D索引转换为2D索引:
result[np.diag_indices_from(result)] = 0.0
idx = np.argsort(result, axis=None)[-10:]
midx = np.unravel_index(idx, result.shape)
print midx
print result[midx]
结果:
(数组([8,14,1,0,11,17,8,10,6,8]),数组([14,1,8,0,1,17,11,10,8,8], 6])) [0.2329741 0.2329741 0.2379527 0.2379527 0.25723394 0.25723394 0.26570327 0.26570327 0.34954834 0.34954834]