Question

所以我正在研究这个文本挖掘项目。我试图打开所有文件，获取组织和摘要的信息，在摘要中拆分单词，然后找出每个单词显示的文件数量。我的问题是关于最后一步：一个单词显示多少个文件？为了回答这个问题，我正在制作一个字典词频率来计算。我想告诉字典：如果单词没有在字典中显示，请捕获附加到它的单词和文件号;如果单词显示在字典中，但文件编号与任何现有文件编号不同，请在其后面附加文件编号。如果单词及其文件编号都已在字典中，请忽略它。以下是我的代码。

capturedfiles = []
capturedabstracts = []
wordFrequency = {}
wordlist=open('test.txt','w')
worddict=open('test3.txt','w')
for filepath in matches[0:5]:
    with open (filepath,'rt') as mytext:
    mytext=mytext.read()
    #print mytext

    # code to capture file organizations.
    grabFile=re.findall(r'File\s+\:\s+(\w\d{7})',mytext)
    if len(grabFile) == 0:
        matchFile= "N/A"
    else:
        matchFile = grabFile[0]
    capturedfiles.append(matchFile)

    # code to capture file abstracts
    grabAbs=re.findall(r'Abstract\s\:\s\d{7}\s(\w.+)',mytext)
    if len(grabAbs) == 0:
        matchAbs= "N/A"
    else:
        matchAbs = grabAbs
    capturedabstracts.append(matchAbs)

    # arrange words in format.
    lineCount = 0
    wordCount = 0
    lines = matchAbs[0].split('. ')
    for line in lines:
        lineCount +=1
        for word in line.split(' '):
            wordCount +=1
            wordlist.write(matchFile + '|' + str(lineCount) + '|' + str(wordCount) + '|' + word + '\n')

            if word not in wordFrequency:
                wordFrequency[word]=[matchFile]
            else:
                if matchFile not in wordFrequency[word]:
                        wordFrequency[word].append(matchFile)
                worddict.write(word + '|' + str(matchFile) + '\n')


wordlist.close()
worddict.close()

我现在得到的是每个单词都以其匹配的文件编号打印出来。如果一个单词在整个文本中出现两次，它将分别打印两次。下面是一个示例：

变异| a9500006 是| a9500006 是| a9500007

我希望它看起来像：

变异| a9500006 是| a9500006，a9500007

Answer 1

不是每次在循环中写入worddict，而是在构建之后编写整个wordFrequency字典。像这样：

#assuming wordFrequency is a correctly built dictionary
for key, value in wordFrequency.items():
    #key is a word, value is a list
    worddict.write(key + '|')
    for word in value:
        #write each word in value
        worddict.write(word)
        #if it's not the last word, write a comma
        if word != value[-1]:
            worddict.write(', ')
    #no more words, end line
    worddict.write('\n')

PS： 从来没有，永远混合标签和空格！特别是在python中！

将多个值与字典Python中的一个键相关联

1 个答案: