Python使用循环计算多行中的出现次数

时间:2015-08-18 09:01:48

标签: python

我想要一个快速的pythonic方法来给我一个循环计数。实际上,我很尴尬地发布我目前无效的解决方案。

给定来自文本文件的样本如下:

script7 BLANK INTERRUPTION script2 launch4.VBS script3 script8 launch3.VBS script5 launch1.VBS script6

我想要一直计算脚本[y]后面是一个启动[X]。 Launch的值范围为1-5,而脚本的范围为1-15。

以script3为例,我需要对给定文件中的以下每一项进行计数:

script3
launch1
#count this

script3
launch2
#count this

script3
launch3
#count this

script3
launch4
#count this

script3
launch4
#count this

script3
launch5
#count this

我认为这里涉及的循环数量超过了我对Python的了解。非常感谢任何帮助。

4 个答案:

答案 0 :(得分:1)

这是一种使用嵌套字典的方法。如果您希望输出格式不同,请告诉我:

#!/usr/bin/env python3

import re
script_dict={}
with open('infile.txt','r') as infile:
    scriptre = re.compile(r"^script\d+$")
    for line in infile:
        line = line.rstrip()
        if scriptre.match(line) is not None:
            script_dict[line] = {}

    infile.seek(0) # go to beginning
    launchre = re.compile(r"^launch\d+\.[vV][bB][sS]$")
    current=None
    for line in infile:
        line = line.rstrip()
        if line in script_dict:
            current=line
        elif launchre.match(line) is not None and current is not None:
            if line not in script_dict[current]:
                script_dict[current][line] = 1 
            else:
                script_dict[current][line] += 1

print(script_dict)

答案 1 :(得分:1)

这是我使用带有计数器和regex with lookahead的defaultdict的解决方案。

import re
from collections import Counter, defaultdict

with open('in.txt', 'r') as f:
    # make sure we have only \n as lineend and no leading or trailing whitespaces
    # this makes the regex less complex
    alltext = '\n'.join(line.strip() for line in f)

# find keyword script\d+ and capture it, then lazy expand and capture everything
# with lookahead so that we stop as soon as and only if next word is 'script' or
# end of the string
scriptPattern = re.compile(r'(script\d+)(.*?)(?=script|\n?$)', re.DOTALL)

# just find everything that matches launch\d+
launchPattern = re.compile(r'launch\d+')

# create a defaultdict with a counter for every entry
scriptDict = defaultdict(Counter)

# go through all matches
for match in scriptPattern.finditer(alltext):
    script, body = match.groups()
    # update the counter of this script
    scriptDict[script].update(launchPattern.findall(body))

# print the results
for script in sorted(scriptDict):
    counter = scriptDict[script]
    if len(counter):
        print('{} launches:'.format(script))
        for launch in sorted(counter):
            count = counter[launch]
            print('\t{} {} time(s)'.format(launch, count))
    else:
        print('{} launches nothing'.format(script))

使用regex101上的字符串(参见上面的链接)我得到以下结果:

script2 launches:
    launch4 1 time(s)
script3 launches nothing
script5 launches:
    launch1 1 time(s)
script6 launches nothing
script7 launches nothing
script8 launches:
    launch3 1 time(s)

答案 2 :(得分:1)

为什么不使用多行正则表达式 - 然后脚本变为:

import re

# read all the text of the file, and clean it up
with open('counts.txt', 'rt') as f:
    alltext = '\n'.join(line.strip() for line in f)

# find all occurrences of the script line followed by the launch line
cont = re.findall('^script(\d)\nlaunch(\d+)\.VBS\n(?mi)',alltext)

# accumulate the counts of each launch number for each script number
# into nested dictionaries
scriptcounts = {}
for scriptnum,launchnum in cont:
    # if we haven't seen this scriptnumber before, create the dictionary for it
    if scriptnum not in scriptcounts:
        scriptcounts[scriptnum]={}
    # if we haven't seen this launchnumber with this scriptnumber before,
    # initialize count to 0
    if launchnum not in scriptcounts[scriptnum]:
        scriptcounts[scriptnum][launchnum] = 0
    # incremement the count for this combination of script and launch number
    scriptcounts[scriptnum][launchnum] += 1

# produce the output in order of increasing scriptnum/launchnum
for scriptnum in sorted(scriptcounts.keys()):
    for launchnum in sorted(scriptcounts[scriptnum].keys()):
        print "script%s\nlaunch%s.VBS\n# count %d\n"%(scriptnum,launchnum,scriptcounts[scriptnum][launchnum])

输出(以您请求的格式)是,例如:

script2
launch1.VBS
# count 1

script2
launch4.VBS
# count 1

script5
launch1.VBS
# count 1

script8
launch3.VBS
# count 3

re.findall()返回所有匹配项的列表 - 每个匹配项是模式的()部分列表,但(?mi)除外,它是指示正则表达式匹配器跨行结束的指令\ n并且匹配不区分大小写。正如图所示的正则表达式模式,例如片段'脚本(\ d)'将脚本/启动后的数字拉出到比赛中 - 这可以很容易地包括脚本'通过'(脚本\ d)',类似地'(启动\ d + \ .VBS)'只有印刷才需要修改来处理这种变化。

HTH 巴尼

答案 3 :(得分:0)

您可以使用setdefault方法

<强>码

dic={}
with open("a.txt") as inp:
    check=0
    key_string=""
    for line in inp:
        if check:
            if line.strip().startswith("launch") and int(line.strip()[6])<6:
                print "yes"
                dic[key_string]=dic.setdefault(key_string,0)+1
            check=0
        if line.strip().startswith("script"):
            key_string=line.strip()
            check=1

对于您的给定输入,输出将是

<强>输出:

{"script3":6}