如何将多个python变量传递到正则表达式以寻找文本文件中的行?

时间:2014-06-03 09:59:36

标签: python regex

如何从以5开头的日志文件中查找行集,并且必须具有2个时间戳值的开始时间和结束时间的格式(例如:14/05/02 02:30:00)用户使用 RegularExpression

输入

我需要一个脚本来搜索我的日志文件中的每一行,搜索3个参数:

1)开始时间(由用户输入)例如: 14/05/02 02:30:00
   2)结束时间(由用户输入)例如: 14/05/02 02:45:00
   3)以数字开头#" 5"

我的日志文件的示例行:

9,14/05/02 02:30:00,1,1,94767539135,94767539135,0,1,172839,0,1,172839,,14/05/02 02:30:00,9477000003,,,,,93,14/05/02 03:30:00,0,0,9477000008,,false,,,,,,,,false,0,5011405020230005756,67000,
5,14/05/02 02:30:00,1,1,94776082043,94776082043,0,1,77100,0,1,77100,,14/05/02 02:30:00,9477000003,,,,,19,14/05/05 02:30:00,0,0,9477000007,9477000003,false,,,,,,,,true,,,0,,5011405020230005752,
11,14/05/02 02:30:00,94776082043,1,9477000051,,,5011405020230005752,
12,14/05/02 02:30:00,true,false,9477000008,413025705057121,,,,5011405020230005748,
3,14/05/02 02:30:00,1,1,94713784377,0,1,1,94771653521,0,1,0713784377,,14/05/02 02:29:48,9477000003,413021500734521,,,,0,14/05/05 02:29:50,,,9477000006,9477000006,,,,,,,,,,,,,0,5011405020229484460,
9,14/05/02 02:30:00,1,1,94771969046,94771969046,0,1,776236,0,1,776236,,14/05/02 02:30:00,9477000003,,,,,62,14/05/05 02:30:00,0,0,9477000008,,false,,,,,,,,false,0,5011405020230005763,67000,
5,14/05/02 02:30:00,1,1,94771059909,94771059909,1,1,94776716217,1,1,94776716217,,14/05/02 02:29:57,9477000003,413020776716217,,,,54,14/05/05 02:29:55,0,0,9477000006,9477000047,false,,,,,,,,false,,,0,,5011405020229575408,

这是我尝试的代码的一部分:

 #!/usr/bin/env python

    import re

    count=0

    fh = open(r"/home/harzyne/pythonscripts/read_log_file.txt")

    yyyy,mo,dd,hh,mm = raw_input("Enter Start_Time in format(yy,mm,dd,hh,mm)").split(',')
    yyyy1,mo1,dd1,hh1,mm1 =raw_input("Enter End_Time in format(yy,mm,dd,hh,mm)").split(',')

    for i in fh:
         if re.search('^5',i):
                count +=1
    print count

try:
    #start_t = datetime(2014,5,2,02,30)
    #end_t = datetime(2014,5,2,02,45)
    start_t = datetime(int(yyyy),int(mo),int(dd),int(hh),int(mm))
    end_t = datetime(int(yyyy1),int(mo1),int(dd1),int(hh1),int(mm1))
    diff = end_t - start_t

except ValueError:
    print ("invalid arguement")
    #start = raw_input("Enter Start_Time in format(yyyy,mm,dd,hh,mm) ")
    #end = raw_input("Enter End_Time in format(yyyy,mm,dd,hh,mm)")


no_of_msg_per_sec = float(count)/diff.seconds
print no_of_msg_per_sec

2 个答案:

答案 0 :(得分:1)

这是一个如何构建搜索模式并计算行数的示例:

#!/usr/bin/python

import re

s = '''9,14/05/02 02:30:00,1,1,94767539135,94767539135,0,1,172839,0,1,172839...
5,14/05/02 02:30:00,1,1,94776082043,94776082043,0,1,77100,0,1,77100,,14/05/0...
11,14/05/02 02:30:00,94776082043,1,9477000051,,,5011405020230005752,
12,14/05/02 02:30:00,true,false,9477000008,413025705057121,,,,50114050202300...
3,14/05/02 02:30:00,1,1,94713784377,0,1,1,94771653521,0,1,0713784377,,14/05/...
9,14/05/02 02:30:00,1,1,94771969046,94771969046,0,1,776236,0,1,776236,,14/05...
5,14/05/02 02:29:59,1,1,94771059909,94771059909,1,1,94776716217,1,1,94776...'''

start_sb = r'14/05/02 02:29:59'
end_sb = r'14/05/02 02:30:00'

p = re.compile(r'^5,' + end_sb + r',.*\n([\s\S]*?)^5,' + start_sb + r',', re.M)

m = p.search(s)

if (m):
    print m.group(1).count("\n")
else
    print 'no result'

我们的想法是将所有内容放在捕获组中的开始和结束限制之间,然后计算该组中换行符的数量。

关于模式本身:

.*将匹配所有字符,直到行的结尾为止 [\s\S]是一个着名的技巧,可以匹配所有角色,包括换行符 ([\s\S]*?)是捕获组1,它使用延迟量词来抓取所有,直到以5开头的第一行和开始日期时间。

re.M选项MULTILINE将{{1>}锚点的含义从字符串开始更改为行开始

答案 1 :(得分:0)

import re

text = '''9,14/05/02 02:30:00,1,1,94767539135,94767539135,0,1,172839,0,1,172839...
5,14/05/02 02:30:00,1,1,94776082043,94776082043,0,1,77100,0,1,77100,,14/05/0...
11,14/05/02 02:30:00,94776082043,1,9477000051,,,5011405020230005752,
12,14/05/02 02:30:00,true,false,9477000008,413025705057121,,,,50114050202300...
3,14/05/02 02:30:00,1,1,94713784377,0,1,1,94771653521,0,1,0713784377,,14/05/...
9,14/05/02 02:30:00,1,1,94771969046,94771969046,0,1,776236,0,1,776236,,14/05...
5,14/05/02 02:29:59,1,1,94771059909,94771059909,1,1,94776716217,1,1,94776...'''

start = r'14/05/02 02:29:59'
end = r'14/05/02 02:30:00'

regex = r'(^5.*(?:' + start + '|' + end + ').*$)'

matches = re.findall(regex, text, re.M)

print matches

这将匹配以下任何行:

  • 以5
  • 开头
  • 包含start OR end

因此,count将是len(matches)

输出:

['5,14/05/02 02:30:00,1,1,94776082043,94776082043,0,1,77100,0,1,77100,,14/05/0...',
'5,14/05/02 02:29:59,1,1,94771059909,94771059909,1,1,94776716217,1,1,94776...']