文件中多个单词/值的总出现次数

时间:2019-06-26 15:31:11

标签: python list dictionary

我有一个包含大量文本的文件。我正在阅读此文件,并打算打印出引用圣经段落的次数,并以“经文”开头的一行注明。然后,我想打印出引用,然后打印出现次数。

示例文件:

Verse- Matthew 5:2
Commentary- Matthew
Verse- Matthew 10:5
Verse- John 3:16
Commentary- John
Verse- Luke 5:2
Commentary- Luke

结果应该是这样的:

{'5:2': 2, '10:5': 1, '3:16': 1}

我正在使用字典制作key:reference的值:出现。该脚本简短且提供:

fileHandle = open("sj", "r")
occurrences = dict()
references = []
#Go through each line if it is a verse line (starts with "Verse"), seperate the reference and count the reference
for line in fileHandle:
    if "Verse" in line:
        verseLine = line.split()
        references.append(verseLine[2]) #Reference is always 3rd index
        for reference in references:
            if reference not in occurrences:
                occurrences[reference] = 1
            else:
                occurrences[reference] = occurrences[reference] + 1
print(" References printed below ")
print(references)
print(" Occerances printed below ")
print(occurrences)

问题:参考文献的计数方式很怪异。这是我的输出:

{'5:2': 5, '10:5': 3, '3:16': 2}

显然那是不对的!我认为这与else:语句有关。例如,如果我将其更改为occurrences[reference] = occurrences[reference] + 2(请注意将1更改为2),那么我希望结果翻倍。但是他们没有:

{'5:2': 9, '10:5': 5, '3:16': 3}

为什么此计数没有正确计数?

4 个答案:

答案 0 :(得分:2)

正在使用references字符串为每一行处理"Verse"列表,因此脚本过多了。

references循环移出line循环。

fileHandle = open("sj", "r")
occurrences = dict()
references = []
#Go through each line if it is a verse line (starts with "Verse"), seperate the reference and count the reference
for line in fileHandle:
    if "Verse" in line:
        verseLine = line.split()
        references.append(verseLine[2]) #Reference is always 3rd index

# After indexing every verse you can start counting them
for reference in references:
    if reference not in occurrences:
        occurrences[reference] = 1
    else:
        occurrences[reference] = occurrences[reference] + 1

print(" References printed below ")
print(references)
print(" Occerances printed below ")
print(occurrences)

除非您需要引用列表以进行进一步处理,否则这是脚本的改进版本:

fileHandle = open("sj", "r")
occurrences = dict()

#Go through each line if it is a verse line (starts with "Verse"), seperate the reference and count the reference
for line in fileHandle:
    if "Verse" in line:
        verseLine = line.split()
        try:
            occurrences[verseLine[2]] += 1
        except KeyError:
            occurrences[verseLine[2]] = 1

fileHandle.close()
print(" References printed below ")
print(references)
print(" Occerances printed below ")
print(occurrences)

答案 1 :(得分:2)

以下是您的代码的一些改进建议:

  • 使用with open('test.txt') as f,这样您就不会忘记最后关闭文件
  • 使用collections.Counter进行计数工作
  • 您只想使用章节和经文编号,还是还应该包括书名?

我的代码:

import collections
c = collections.Counter()

with open('test.txt') as f:
    for line in f:
        line = line.strip()
        if len(line) > 0:
            if line.startswith('Verse'):
                data = line[6:]               # Book, chapter and verse number
                # data = line.split()[2]      # only chapter and verse number

                c.update({data: 1})

print('all:')
for k, count in c.items():
    print(' ', count, k)

print('most common:')
for k, count in c.most_common(1):
    print(' ', count, k)

答案 2 :(得分:2)

使用recollections.Counter的另一个版本:

data = '''Verse- Matthew 5:2
Commentary- Matthew
Verse- Matthew 10:5
Verse- John 3:16
Commentary- John
Verse- Luke 5:2
Commentary- Luke'''

import re
from collections import Counter

c = Counter( re.findall(r'^Verse.*?(\d+:\d+)$', data, flags=re.M) )
print(dict(c))

打印:

{'5:2': 2, '10:5': 1, '3:16': 1}

答案 3 :(得分:1)

这是固定代码

fileHandle = open("sj", "r")
occurrences = dict()
references = []
# Go through each line if it is a verse line (starts with "Verse"), seperate the reference and count the reference
for line in fileHandle:
    if line.startswith("Verse"):
        verseLine = line.split()
        try:
            occurrences[verseLine[2]] += 1  # Reference is always 3rd index
        except KeyError:
            occurrences[verseLine[2]] = 1
print(" References printed below ")
print(references)
print(" Occerances printed below ")
print(occurrences)

我认为发生这种情况是因为,对于包含'Verse'行,您增加了所有引用的发生值。 (请注意,我将"Verse" in line更改为line.startswith("Verse"), so the code will only execute if the line starts with“诗歌”`。