我有2个文档doc1.txt
和doc2.txt
。这两份文件的内容是:
#doc1.txt
very good, very bad, you are great
#doc2.txt
very bad, good restaurent, nice place to visit
我想用,
分隔我的语料库,以便我的最终DocumentTermMatrix
成为:
terms
docs very good very bad you are great good restaurent nice place to visit
doc1 tf-idf tf-idf tf-idf 0 0
doc2 0 tf-idf 0 tf-idf tf-idf
我知道,如何计算单个单词的DocumentTermMatrix
(使用http://scikit-learn.org/stable/modules/feature_extraction.html),但不知道如何计算Python中DocumentTermMatrix
的{{1}}。< / p>
答案 0 :(得分:5)
您可以将TfidfVectorizer
的analyzer
参数指定为以自定义方式提取要素的函数:
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ['very good, very bad, you are great',
'very bad, good restaurent, nice place to visit']
tfidf = TfidfVectorizer(analyzer=lambda d: d.split(', ')).fit(docs)
print tfidf.get_feature_names()
结果特征是:
['good restaurent', 'nice place to visit', 'very bad', 'very good', 'you are great']
如果您真的无力将所有数据加载到内存中,这是一种解决方法:
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ['doc1.txt', 'doc2.txt']
def extract(filename):
with open(filename) as f:
features = []
for line in f:
features += line.strip().split(', ')
return features
tfidf = TfidfVectorizer(analyzer=extract).fit(docs)
print tfidf.get_feature_names()
一次加载一个文档,而不是一次将所有文档都保存在内存中。