什么是理想的tfidf矩阵

时间:2017-02-27 15:31:10

标签: python-3.x machine-learning tf-idf

当我为一组文件运行tfidf时,它返回了一个tfidf矩阵,看起来像这样。

(1, 12) 0.656240233446
  (1, 11)   0.754552023393
  (2, 6)    1.0
  (3, 13)   1.0
  (4, 2)    1.0
  (7, 9)    1.0
  (9, 4)    0.742540927053
  (9, 5)    0.66980069547
  (11, 19)  0.735138466738
  (11, 7)   0.677916982176
  (12, 18)  1.0
  (13, 14)  0.697455191865
  (13, 11)  0.716628394177
  (14, 5)   1.0
  (15, 8)   1.0
  (16, 17)  1.0
  (18, 1)   1.0
  (19, 17)  1.0
  (22, 13)  1.0
  (23, 3)   1.0
  (25, 6)   1.0
  (26, 19)  0.476648253537
  (26, 7)   0.879094103268
  (28, 10)  0.532672175403
  (28, 7)   0.523456282204

我想知道这是什么,我无法理解这是如何提供的。 当我处于调试模式时,我了解了索引,indptr和数据...这些东西与给定的数据相关联。这些是什么? 数字中有很多混淆,如果我说括号中的第一个元素是基于我的预测的文档,我看不到第0,第5,第6个文档。 请帮助我弄清楚它是如何在这里工作的。但是我知道wiki的tfidf的一般工作,记录逆文档和其他东西。我只想知道这3种不同的数字是什么,它是指什么?

源代码是:

#This contains the list of file names 
_filenames =[]
#This conatains the list if contents/text in the file
_contents = []
#This is a dict of filename:content
_file_contents = {}
class KmeansClustering():   
   def kmeansClusters(self):
        global _report
            self.num_clusters = 5
            km = KMeans(n_clusters=self.num_clusters)
            vocab_frame = TokenizingAndPanda().createPandaVocabFrame()
            self.tfidf_matrix, self.terms, self.dist = TfidfProcessing().getTfidFPropertyData()
            km.fit(self.tfidf_matrix)
            self.clusters = km.labels_.tolist()
            joblib.dump(km, 'doc_cluster2.pkl')
            km = joblib.load('doc_cluster2.pkl')

class TokenizingAndPanda():

    def tokenize_only(self,text):
        '''
        This function tokenizes the text
        :param text: Give the text that you want to tokenize
        :return: it gives the filter tokes
        '''
        # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
        tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
        filtered_tokens = []
        # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
        for token in tokens:
            if re.search('[a-zA-Z]', token):
                filtered_tokens.append(token)
        return filtered_tokens

    def tokenize_and_stem(self,text):
        # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
        tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
        filtered_tokens = []
        # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
        for token in tokens:
            if re.search('[a-zA-Z]', token):
                filtered_tokens.append(token)
        stems = [_stemmer.stem(t) for t in filtered_tokens]
        return stems

    def getFilnames(self):
        '''

        :return:
        '''
        global _path
        global _filenames
        path = _path
        _filenames = FileAccess().read_all_file_names(path)


    def getContentsForFilenames(self):
        global _contents
        global _file_contents
        for filename in _filenames:
            content = FileAccess().read_the_contents_from_files(_path, filename)
            _contents.append(content)
            _file_contents[filename] = content

    def createPandaVocabFrame(self):
        global _totalvocab_stemmed
        global _totalvocab_tokenized
        #Enable this if you want to load the filenames and contents from a file structure.
        # self.getFilnames()
        # self.getContentsForFilenames()

        # for name, i in _file_contents.items():
        #     print(name)
        #     print(i)
        for i in _contents:
            allwords_stemmed = self.tokenize_and_stem(i)
            _totalvocab_stemmed.extend(allwords_stemmed)

            allwords_tokenized = self.tokenize_only(i)
            _totalvocab_tokenized.extend(allwords_tokenized)
        vocab_frame = pd.DataFrame({'words': _totalvocab_tokenized}, index=_totalvocab_stemmed)
        print(vocab_frame)
        return vocab_frame


class TfidfProcessing():

    def getTfidFPropertyData(self):
        tfidf_vectorizer = TfidfVectorizer(max_df=0.4, max_features=200000,
                                           min_df=0.02, stop_words='english',
                                           use_idf=True, tokenizer=TokenizingAndPanda().tokenize_and_stem, ngram_range=(1, 1))
        # print(_contents)
        tfidf_matrix = tfidf_vectorizer.fit_transform(_contents)
        terms = tfidf_vectorizer.get_feature_names()
        dist = 1 - cosine_similarity(tfidf_matrix)

        return tfidf_matrix, terms, dist

1 个答案:

答案 0 :(得分:1)

应用于数据的tfidf的结果通常是2D矩阵A,其中A_ij是第i个文档中的标准化第j个术语(字)频率。您在输出中看到的是此矩阵的稀疏表示,换句话说 - 只打印出非零的元素,因此:

(1, 12) 0.656240233446

表示第12个单词(根据sklearn建立的某些词汇表)在第一个文档中具有标准化频率0.656240233446。 "失踪"位为零,这意味着例如第一个文档中找不到第三个单词(因为没有(1,3))等等。

某些文档丢失的事实是您的特定代码/数据(您没有包含)的结果,也许您手动设置了词汇表?或考虑最大数量的功能? TfidfVectorizer中有许多参数可以导致这种情况,但如果没有您的确切代码(以及一些示例性数据),则无法说出其他任何参数。例如,设置min_df会导致(因为它丢弃非常罕见的单词)类似max_features(效果相同)