我有一段使用文本blob索引单词的代码。我目前的输出来自for循环per' doc' (如doc1,doc2,doc3等)
从每个文档中我都希望得到4个最重要单词的向量,并希望将其索引号返回到4,1 np.array。不幸的是,我似乎无法解决这个问题。
bloblist = [doc1, doc2, doc3, doc4, doc5, doc6, doc7, doc8, doc9, doc10, doc11]
for i, blob in enumerate(bloblist):
print("Top words in doc {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
corpus = blob.words
wordIndex = list(enumerate(corpus))
for word, score in sorted_words[:4]:
arr = (corpus.index(word))
print(arr)
这会产生以下结果:
Top words in doc 1
5
0
1
2
Top words in doc 2
19
12
41
4
哪个很酷,但我希望像这样
Top words in doc 1
[5,0,1,2]
有人可以帮帮我吗?
答案 0 :(得分:1)
感谢Oli,我找到了适合我的解决方案。
bloblist = [doc1, doc2, doc3, doc4, doc5, doc6, doc7, doc8, doc9, doc10, doc11]
for i, blob in enumerate(bloblist):
print("Top words in doc {}".format(i + 1))
scores = {word: tfidf(word, blob, bloblist) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
corpus = blob.words
wordIndex = list(enumerate(corpus))
arr = np.array([])
for word, score in sorted_words[:4]:
arrw = np.array([corpus.index(word)])
arr = np.concatenate((arr, arrw))
print(arr)
arr = arr.reshape(4,1)
print(arr.shape)
提供以下所需的输出:
Top words in doc 1
[ 5. 0. 1. 2.]
(4, 1)
Top words in doc 2
[ 19. 12. 41. 4.]
(4, 1)
Top words in doc 3
[ 16. 2. 6. 7.]
(4, 1)