我有以下代码:
import os
from sklearn.feature_extraction.text import TfidfVectorizer
path = 'some path'
textIN = [os.path.join(dirpath, f) for dirpath, dirnames, files in
os.walk(path) for f in files if f.endswith('.txt')]
docs = [open(f) for f in textIN]
print textIN
tfidf = TfidfVectorizer(input='file', encoding='utf-8',
stop_words='english', norm='l2').fit_transform(docs)
pair = tfidf * tfidf.T
print pair
它工作正常并输出以下内容:
[filename, filename, filename, etc.]
(0, 7) 0.0993597661923
(0, 8) 0.0118954936469
(0, 9) 0.147109830057
(0, 10) 0.0791162122306
(0, 6) 0.0433550484844
(0, 5) 0.0892228473038
(0, 4) 0.0356736412472
(0, 3) 0.693555573615
(0, 2) 0.0346887846227
(0, 1) 0.0279157259462
(0, 0) 1.0
(1, 0) 0.0279157259462
(1, 7) 0.0395168969129
(1, 8) 0.0247319167695
(1, 9) 0.110314319112
(1, 5) 0.0348945360205
(1, 4) 0.288812927116
(1, 3) 0.0966845883594
(1, 10) 0.153976266391
(1, 6) 0.271902487932
(1, 2) 0.100596627508
(1, 1) 1.0
(2, 0) 0.0346887846227
(2, 1) 0.100596627508
(2, 8) 0.0127731857591
: :
(8, 6) 0.0380354696531
我希望能够用相应的文件名替换索引值,并显示所有结果。
感谢您的帮助。