我有一个包含大量文本的文件。我正在阅读此文件,并打算打印出引用圣经段落的次数,并以“经文”开头的一行注明。然后,我想打印出引用,然后打印出现次数。
示例文件:
Verse- Matthew 5:2
Commentary- Matthew
Verse- Matthew 10:5
Verse- John 3:16
Commentary- John
Verse- Luke 5:2
Commentary- Luke
结果应该是这样的:
{'5:2': 2, '10:5': 1, '3:16': 1}
我正在使用字典制作key:reference的值:出现。该脚本简短且提供:
fileHandle = open("sj", "r")
occurrences = dict()
references = []
#Go through each line if it is a verse line (starts with "Verse"), seperate the reference and count the reference
for line in fileHandle:
if "Verse" in line:
verseLine = line.split()
references.append(verseLine[2]) #Reference is always 3rd index
for reference in references:
if reference not in occurrences:
occurrences[reference] = 1
else:
occurrences[reference] = occurrences[reference] + 1
print(" References printed below ")
print(references)
print(" Occerances printed below ")
print(occurrences)
问题:参考文献的计数方式很怪异。这是我的输出:
{'5:2': 5, '10:5': 3, '3:16': 2}
显然那是不对的!我认为这与else:
语句有关。例如,如果我将其更改为occurrences[reference] = occurrences[reference] + 2
(请注意将1更改为2),那么我希望结果翻倍。但是他们没有:
{'5:2': 9, '10:5': 5, '3:16': 3}
为什么此计数没有正确计数?
答案 0 :(得分:2)
正在使用references
字符串为每一行处理"Verse"
列表,因此脚本过多了。
将references
循环移出line
循环。
fileHandle = open("sj", "r")
occurrences = dict()
references = []
#Go through each line if it is a verse line (starts with "Verse"), seperate the reference and count the reference
for line in fileHandle:
if "Verse" in line:
verseLine = line.split()
references.append(verseLine[2]) #Reference is always 3rd index
# After indexing every verse you can start counting them
for reference in references:
if reference not in occurrences:
occurrences[reference] = 1
else:
occurrences[reference] = occurrences[reference] + 1
print(" References printed below ")
print(references)
print(" Occerances printed below ")
print(occurrences)
除非您需要引用列表以进行进一步处理,否则这是脚本的改进版本:
fileHandle = open("sj", "r")
occurrences = dict()
#Go through each line if it is a verse line (starts with "Verse"), seperate the reference and count the reference
for line in fileHandle:
if "Verse" in line:
verseLine = line.split()
try:
occurrences[verseLine[2]] += 1
except KeyError:
occurrences[verseLine[2]] = 1
fileHandle.close()
print(" References printed below ")
print(references)
print(" Occerances printed below ")
print(occurrences)
答案 1 :(得分:2)
以下是您的代码的一些改进建议:
with open('test.txt') as f
,这样您就不会忘记最后关闭文件collections.Counter
进行计数工作我的代码:
import collections
c = collections.Counter()
with open('test.txt') as f:
for line in f:
line = line.strip()
if len(line) > 0:
if line.startswith('Verse'):
data = line[6:] # Book, chapter and verse number
# data = line.split()[2] # only chapter and verse number
c.update({data: 1})
print('all:')
for k, count in c.items():
print(' ', count, k)
print('most common:')
for k, count in c.most_common(1):
print(' ', count, k)
答案 2 :(得分:2)
使用re
和collections.Counter
的另一个版本:
data = '''Verse- Matthew 5:2
Commentary- Matthew
Verse- Matthew 10:5
Verse- John 3:16
Commentary- John
Verse- Luke 5:2
Commentary- Luke'''
import re
from collections import Counter
c = Counter( re.findall(r'^Verse.*?(\d+:\d+)$', data, flags=re.M) )
print(dict(c))
打印:
{'5:2': 2, '10:5': 1, '3:16': 1}
答案 3 :(得分:1)
这是固定代码
fileHandle = open("sj", "r")
occurrences = dict()
references = []
# Go through each line if it is a verse line (starts with "Verse"), seperate the reference and count the reference
for line in fileHandle:
if line.startswith("Verse"):
verseLine = line.split()
try:
occurrences[verseLine[2]] += 1 # Reference is always 3rd index
except KeyError:
occurrences[verseLine[2]] = 1
print(" References printed below ")
print(references)
print(" Occerances printed below ")
print(occurrences)
我认为发生这种情况是因为,对于包含'Verse'
的每行,您增加了所有引用的发生值。 (请注意,我将"Verse" in line
更改为line.startswith("Verse"), so the code will only execute if the line starts with
“诗歌”`。