使用Python查找文档频率

时间:2016-02-04 21:51:35

标签: python python-2.7

嘿大家我知道这已经在这里问了好几次但是我很难用python查找文档频率。我试图找到TF-IDF,然后找到它们和查询之间的cosin分数,但我仍然坚持找到文档频率。这就是我到目前为止所做的:

#includes
import re
import os
import operator
import glob
import sys
import math
from collections import Counter

#number of command line argument checker
if len(sys.argv) != 3:
    print 'usage: ./part3_soln2.py "path to folder in quotation marks" query.txt'
    sys.exit(1)

#Read in the directory to the files
    path = sys.argv[1]

#Read in the query
y = sys.argv[2]
querystart = re.findall(r'\w+', open(y).read().lower())
query = [Z for Z in querystart]
Query_vec = Counter(query)
print Query_vec

#counts total number of documents in the directory
doccounter = len(glob.glob1(path,"*.txt"))

if os.path.exists(path) and os.path.isfile(y):
    word_TF = []
    word_IDF = {}
    TFvec = []
    IDFvec = []

    #this is my attempt at finding IDF
    for filename in glob.glob(os.path.join(path, '*.txt')):

        words_IDF = re.findall(r'\w+', open(filename).read().lower())

        doc_IDF = [A for A in words_IDF if len(A) >= 3 and A.isalpha()]

        word_IDF = doc_IDF

        #psudocode!! 
        """
        for key in word_idf:
            if key in word_idf:
                word_idf[key] =+1
            else:
                word_idf[key] = 1

    print word_IDF
    """ 

    #goes to that directory and reads in the files there
    for filename in glob.glob(os.path.join(path, '*.txt')):

        words_TF = re.findall(r'\w+', open(filename).read().lower())

        #scans each document for words greater or equal to 3 in length
        doc_TF = [A for A in words_TF if len(A) >= 3 and A.isalpha()]

        #this assigns values to each term this is my TF for each vector
        TFvec = Counter(doc_TF)

        #weighing the Tf with a log function
        for key in TFvec: 
            TFvec[key] = 1 + math.log10(TFvec[key])


    #placed here so I dont get a command line full of text  
    print TFvec 

#Error checker
else:
    print "That path does not exist"

我正在使用python 2,到目前为止,我真的不知道如何计算一个术语出现的文档数量。我可以找到文档的总数,但我真的很难找到文档的数量术语出现在。我只是要创建一个大型词典,其中包含所有文档中的所有术语,这些术语可以在以后查询需要这些术语时获取。谢谢你能给我的任何帮助。

1 个答案:

答案 0 :(得分:2)

术语x的DF是出现x的文档的数量。为了找到它,您需要首先迭代所有文档。只有这样你才能从DF计算出IDF。

您可以使用字典来计算DF:

  1. 迭代所有文件
  2. 对于每个文档,检索它的一组单词(不重复)
  3. 增加第2阶段每个单词的DF计数。因此,无论单词在文档中的次数多少,您都会将计数精确地增加1。
  4. Python代码可能如下所示:

    from collections import defaultdict
    import math
    
    DF = defaultdict(int) 
    for filename in glob.glob(os.path.join(path, '*.txt')):
        words = re.findall(r'\w+', open(filename).read().lower())
        for word in set(words):
            if len(word) >= 3 and word.isalpha():
                DF[word] += 1  # defaultdict simplifies your "if key in word_idf: ..." part.
    
    # Now you can compute IDF.
    IDF = dict()
    for word in DF:
        IDF[word] = math.log(doccounter / float(DF[word])) # Don't forget that python2 uses integer division.
    

    PS它有助于学习手动实现,但如果您遇到困难,我建议您查看NLTK包。它为处理语料库(文本集合)提供了有用的功能。