Question

嘿大家我知道这已经在这里问了好几次但是我很难用python查找文档频率。我试图找到TF-IDF，然后找到它们和查询之间的cosin分数，但我仍然坚持找到文档频率。这就是我到目前为止所做的：

#includes
import re
import os
import operator
import glob
import sys
import math
from collections import Counter

#number of command line argument checker
if len(sys.argv) != 3:
    print 'usage: ./part3_soln2.py "path to folder in quotation marks" query.txt'
    sys.exit(1)

#Read in the directory to the files
    path = sys.argv[1]

#Read in the query
y = sys.argv[2]
querystart = re.findall(r'\w+', open(y).read().lower())
query = [Z for Z in querystart]
Query_vec = Counter(query)
print Query_vec

#counts total number of documents in the directory
doccounter = len(glob.glob1(path,"*.txt"))

if os.path.exists(path) and os.path.isfile(y):
    word_TF = []
    word_IDF = {}
    TFvec = []
    IDFvec = []

    #this is my attempt at finding IDF
    for filename in glob.glob(os.path.join(path, '*.txt')):

        words_IDF = re.findall(r'\w+', open(filename).read().lower())

        doc_IDF = [A for A in words_IDF if len(A) >= 3 and A.isalpha()]

        word_IDF = doc_IDF

        #psudocode!! 
        """
        for key in word_idf:
            if key in word_idf:
                word_idf[key] =+1
            else:
                word_idf[key] = 1

    print word_IDF
    """ 

    #goes to that directory and reads in the files there
    for filename in glob.glob(os.path.join(path, '*.txt')):

        words_TF = re.findall(r'\w+', open(filename).read().lower())

        #scans each document for words greater or equal to 3 in length
        doc_TF = [A for A in words_TF if len(A) >= 3 and A.isalpha()]

        #this assigns values to each term this is my TF for each vector
        TFvec = Counter(doc_TF)

        #weighing the Tf with a log function
        for key in TFvec: 
            TFvec[key] = 1 + math.log10(TFvec[key])


    #placed here so I dont get a command line full of text  
    print TFvec 

#Error checker
else:
    print "That path does not exist"

我正在使用python 2，到目前为止，我真的不知道如何计算一个术语出现的文档数量。我可以找到文档的总数，但我真的很难找到文档的数量术语出现在。我只是要创建一个大型词典，其中包含所有文档中的所有术语，这些术语可以在以后查询需要这些术语时获取。谢谢你能给我的任何帮助。

Answer 1

术语x的DF是出现x的文档的数量。为了找到它，您需要首先迭代所有文档。只有这样你才能从DF计算出IDF。

您可以使用字典来计算DF：

迭代所有文件
对于每个文档，检索它的一组单词（不重复）
增加第2阶段每个单词的DF计数。因此，无论单词在文档中的次数多少，您都会将计数精确地增加1。

Python代码可能如下所示：

from collections import defaultdict
import math

DF = defaultdict(int) 
for filename in glob.glob(os.path.join(path, '*.txt')):
    words = re.findall(r'\w+', open(filename).read().lower())
    for word in set(words):
        if len(word) >= 3 and word.isalpha():
            DF[word] += 1  # defaultdict simplifies your "if key in word_idf: ..." part.

# Now you can compute IDF.
IDF = dict()
for word in DF:
    IDF[word] = math.log(doccounter / float(DF[word])) # Don't forget that python2 uses integer division.

PS它有助于学习手动实现，但如果您遇到困难，我建议您查看NLTK包。它为处理语料库（文本集合）提供了有用的功能。

使用Python查找文档频率

1 个答案: