Question

我有一个包含大量文本的文件。我正在阅读此文件，并打算打印出引用圣经段落的次数，并以“经文”开头的一行注明。然后，我想打印出引用，然后打印出现次数。

示例文件：

Verse- Matthew 5:2
Commentary- Matthew
Verse- Matthew 10:5
Verse- John 3:16
Commentary- John
Verse- Luke 5:2
Commentary- Luke

结果应该是这样的：

{'5:2': 2, '10:5': 1, '3:16': 1}

我正在使用字典制作key：reference的值：出现。该脚本简短且提供：

fileHandle = open("sj", "r")
occurrences = dict()
references = []
#Go through each line if it is a verse line (starts with "Verse"), seperate the reference and count the reference
for line in fileHandle:
    if "Verse" in line:
        verseLine = line.split()
        references.append(verseLine[2]) #Reference is always 3rd index
        for reference in references:
            if reference not in occurrences:
                occurrences[reference] = 1
            else:
                occurrences[reference] = occurrences[reference] + 1
print(" References printed below ")
print(references)
print(" Occerances printed below ")
print(occurrences)

问题：参考文献的计数方式很怪异。这是我的输出：

{'5:2': 5, '10:5': 3, '3:16': 2}

显然那是不对的！我认为这与else:语句有关。例如，如果我将其更改为occurrences[reference] = occurrences[reference] + 2（请注意将1更改为2），那么我希望结果翻倍。但是他们没有：

{'5:2': 9, '10:5': 5, '3:16': 3}

为什么此计数没有正确计数？

Answer 1

正在使用references字符串为每一行处理"Verse"列表，因此脚本过多了。

将references循环移出line循环。

fileHandle = open("sj", "r")
occurrences = dict()
references = []
#Go through each line if it is a verse line (starts with "Verse"), seperate the reference and count the reference
for line in fileHandle:
    if "Verse" in line:
        verseLine = line.split()
        references.append(verseLine[2]) #Reference is always 3rd index

# After indexing every verse you can start counting them
for reference in references:
    if reference not in occurrences:
        occurrences[reference] = 1
    else:
        occurrences[reference] = occurrences[reference] + 1

print(" References printed below ")
print(references)
print(" Occerances printed below ")
print(occurrences)

除非您需要引用列表以进行进一步处理，否则这是脚本的改进版本：

fileHandle = open("sj", "r")
occurrences = dict()

#Go through each line if it is a verse line (starts with "Verse"), seperate the reference and count the reference
for line in fileHandle:
    if "Verse" in line:
        verseLine = line.split()
        try:
            occurrences[verseLine[2]] += 1
        except KeyError:
            occurrences[verseLine[2]] = 1

fileHandle.close()
print(" References printed below ")
print(references)
print(" Occerances printed below ")
print(occurrences)

Answer 2

以下是您的代码的一些改进建议：

使用with open('test.txt') as f，这样您就不会忘记最后关闭文件
使用collections.Counter进行计数工作
您只想使用章节和经文编号，还是还应该包括书名？

我的代码：

import collections
c = collections.Counter()

with open('test.txt') as f:
    for line in f:
        line = line.strip()
        if len(line) > 0:
            if line.startswith('Verse'):
                data = line[6:]               # Book, chapter and verse number
                # data = line.split()[2]      # only chapter and verse number

                c.update({data: 1})

print('all:')
for k, count in c.items():
    print(' ', count, k)

print('most common:')
for k, count in c.most_common(1):
    print(' ', count, k)

Answer 3

使用re和collections.Counter的另一个版本：

data = '''Verse- Matthew 5:2
Commentary- Matthew
Verse- Matthew 10:5
Verse- John 3:16
Commentary- John
Verse- Luke 5:2
Commentary- Luke'''

import re
from collections import Counter

c = Counter( re.findall(r'^Verse.*?(\d+:\d+)$', data, flags=re.M) )
print(dict(c))

打印：

{'5:2': 2, '10:5': 1, '3:16': 1}

Answer 4

这是固定代码

fileHandle = open("sj", "r")
occurrences = dict()
references = []
# Go through each line if it is a verse line (starts with "Verse"), seperate the reference and count the reference
for line in fileHandle:
    if line.startswith("Verse"):
        verseLine = line.split()
        try:
            occurrences[verseLine[2]] += 1  # Reference is always 3rd index
        except KeyError:
            occurrences[verseLine[2]] = 1
print(" References printed below ")
print(references)
print(" Occerances printed below ")
print(occurrences)

我认为发生这种情况是因为，对于包含'Verse'的每行，您增加了所有引用的发生值。（请注意，我将"Verse" in line更改为line.startswith("Verse"), so the code will only execute if the line starts with“诗歌”`。

文件中多个单词/值的总出现次数

4 个答案: