将索引值替换为实际文件名

时间:2016-06-20 21:06:37

标签: python nlp scikit-learn nltk

我有以下代码:

import os
from sklearn.feature_extraction.text import TfidfVectorizer

path = 'some path'

textIN = [os.path.join(dirpath, f) for dirpath, dirnames, files in 
    os.walk(path) for f in files if f.endswith('.txt')]
docs = [open(f) for f in textIN]
print textIN
tfidf = TfidfVectorizer(input='file', encoding='utf-8', 
    stop_words='english', norm='l2').fit_transform(docs)
pair = tfidf * tfidf.T
print pair

它工作正常并输出以下内容:

[filename, filename, filename, etc.]    
 (0, 7)        0.0993597661923
 (0, 8)        0.0118954936469
 (0, 9)        0.147109830057
 (0, 10)       0.0791162122306
 (0, 6)        0.0433550484844
 (0, 5)        0.0892228473038
 (0, 4)        0.0356736412472
 (0, 3)        0.693555573615
 (0, 2)        0.0346887846227
 (0, 1)        0.0279157259462
 (0, 0)        1.0
 (1, 0)        0.0279157259462
 (1, 7)        0.0395168969129
 (1, 8)        0.0247319167695
 (1, 9)        0.110314319112
 (1, 5)        0.0348945360205
 (1, 4)        0.288812927116
 (1, 3)        0.0966845883594
 (1, 10)       0.153976266391
 (1, 6)        0.271902487932
 (1, 2)        0.100596627508
 (1, 1)        1.0
 (2, 0)        0.0346887846227
 (2, 1)        0.100596627508
 (2, 8)        0.0127731857591
 :     :
 (8, 6)        0.0380354696531

我希望能够用相应的文件名替换索引值,并显示所有结果。

感谢您的帮助。

0 个答案:

没有答案