加权倒排索引的Python 3字典

时间:2015-04-30 01:45:41

标签: python python-3.x dictionary

首先,这是家庭作业,所以我想请一些建议。我正在编写一个生成加权倒排索引的程序。加权倒排索引是以单词为关键字的字典;该值是列表列表,列表中的每个项目都包含文档编号,以及该文档在文档中出现的次数。

例如,

{"a": [[1, 2],[2,1]]}
The word "a" appears twice in document 1 and once in document 2.

我正在练习两个小文件。

FILE1.TXT:

    Where should I go
    When I want to have
    A smoke,
    A pancake, 
    and a nap.

FILE2.TXT:

I do not know
Where my pancake is
I want to take a nap.

这是我的程序代码:

def cleanData(myFile):
    file = open(myFile, "r")

    data = file.read()
    wordList = []

    #All numbers and end-of-sentence punctuation
    #replaced with the empty string
    #No replacement of apostrophes
    formattedData = data.strip().lower().replace(",","")\
                 .replace(".","").replace("!","").replace("?","")\
                 .replace(";","").replace(":","").replace('"',"")\
                 .replace("1","").replace("2","").replace("3","")\
                 .replace("4","").replace("5","").replace("6","")\
                 .replace("7","").replace("8","").replace("9","")\
                 .replace("0","")

    words = formattedData.split() #creates a list of all words in the document
    for word in words:
        wordList.append(word)     #adds each word in a document to the word list
    return wordList

def main():

fullDict = {}

files = ["file1.txt", "file2.txt"]
docNumber = 1

for file in files:
    wordList = cleanData(file)

    for word in wordList:
        if word not in fullDict:
            fullDict[word] = []
            fileList = [docNumber, 1]
            fullDict[word].append(fileList)
        else:
            listOfValues = list(fullDict.values())
            for x in range(len(listOfValues)):
                if docNumber == listOfValues[x][0]:
                    listOfValues[x][1] +=1
                    fullDict[word] = listOfValues
                    break
            fileList = [docNumber,1]
            fullDict[word].append(fileList)

    docNumber +=1
return fullDict

我要做的是生成这样的东西:

{"a": [[1,3],[2,1]], "nap": [[1,1],[2,1]]}

我得到的是:

{"a": [[1,1],[1,1],[1,1],[2,1]], "nap": [[1,1],[2,1]]}

它记录所有文档中每个单词的所有出现次数,但它会分别记录重复次数。我无法弄清楚这一点。任何帮助,将不胜感激!先感谢您。 :)

2 个答案:

答案 0 :(得分:2)

您的代码中存在两个主要问题。

问题1

        listOfValues = list(fullDict.values())
        for x in range(len(listOfValues)):
            if docNumber == listOfValues[x][0]:

在这里,您可以获取字典的所有值,而不管当前的单词,并递增计数,但是您应该在与当前单词对应的列表中递增计数。所以,你应该把它改成

listOfValues = fullDict[word]

问题2

        fileList = [docNumber,1]
        fullDict[word].append(fileList)

除了增加所有单词的计数外,您总是向fullDict添加新值。但是,只有当docNumber中没有listOfValues时,您才应该添加它。因此,您可以将elsefor循环一起使用,就像这样

    for word in wordList:
        if word not in fullDict:
            ....
        else:
            listOfValues = fullDict[word]
            for x in range(len(listOfValues)):
                ....
            else:
                fileList = [docNumber, 1]
                fullDict[word].append(fileList)

进行这两项更改后,我得到以下输出

{'a': [[1, 3], [2, 1]],
 'and': [[1, 1]],
 'do': [[2, 1]],
 'go': [[1, 1]],
 'have': [[1, 1]],
 'i': [[1, 2], [2, 2]],
 'is': [[2, 1]],
 'know': [[2, 1]],
 'my': [[2, 1]],
 'nap': [[1, 1], [2, 1]],
 'not': [[2, 1]],
 'pancake': [[1, 1], [2, 1]],
 'should': [[1, 1]],
 'smoke': [[1, 1]],
 'take': [[2, 1]],
 'to': [[1, 1], [2, 1]],
 'want': [[1, 1], [2, 1]],
 'when': [[1, 1]],
 'where': [[1, 1], [2, 1]]}

很少有改进代码的建议。

  1. 您可以使用字典,而不是使用列表来存储文档编号和计数。这会让你的生活更轻松。

  2. 您可以使用collections.Counter

  3. ,而不是手动计算
  4. 您可以使用简单的正则表达式,而不是使用多次替换,例如

    formattedData = re.sub(r'[.!?;:"0-9]', '', data.strip().lower())
    
  5. 如果我要清理cleanData,我会这样做

    import re
    def cleanData(myFile):
        with open(myFile, "r") as input_file:
            data = input_file.read()
        return re.sub(r'[.!?;:"0-9]', '', data.strip().lower()).split()
    

    main循环中,您可以使用Brad Budlong建议的改进,例如

    def main():
        fullDict = {}
        files = ["file1.txt", "file2.txt"]
        for docNumber, currentFile in enumerate(files, 1):
            for word in cleanData(currentFile):
                if word not in fullDict:
                    fullDict[word] = [[docNumber, 1]]
                else:
                    for x in fullDict[word]:
                        if docNumber == x[0]:
                            x[1] += 1
                            break
                    else:
                        fullDict[word].append([docNumber, 1])
        return fullDict
    

答案 1 :(得分:1)

我首选的for循环实现不使用len和range函数进行迭代。由于这些都是可变列表,因此您不需要知道索引,只需要拥有每个列表,然后可以在没有索引的情况下进行修改。我用以下内容替换了for循环,并获得与thefourtheye相同的输出。

for word in wordList:
    if word not in fullDict:
        fullDict[word] = [[docNumber, 1]]
    else:
        for val in fullDict[word]:
            if val[0] == docNumber:
                val[1] += 1
                break
        else:
            fullDict[word].append([docNumber, 1])