Python:如何按日期计算没有时间戳

时间:2017-09-12 18:01:39

标签: python python-2.7 collections sys

这是我的数据格式:

    [Mon May 02 15:38:50 2016] [error] [client XX.XX.XX.XX] File does not exist: /home/XXX/XXXX/XXX/XXX/XXX.shtml

这是我的代码我试图按日期显示行数:

    # datecount.py
    import sys, collections

    # sys.argv is the list of command-line arguments
    # sys.arg[0] is the name of the program itself
    # sys.arg[1] is optional and will be the file name

    # set input based on number of arguments
    if len(sys.argv) == 1:
        f = sys.stdin
    elif len(sys.argv) == 2:
        try:
            f = open(sys.argv[1])
        except IOError:
            print "Cannot open", sys.argv[1]
            sys.exit()
    else:
        print "USAGE: python datecount [FILE]"
        sys.exit()

    dateCounts = collections.Counter()
    # for every line passed into the script
    for line in f:
        # find indices of date section
        start = line.find("[")
        if start >= 0 :
            end = line.find("]", start)
            # graph just the date
            date = line[start+21: end] #by YEAR
            dateCounts[date]=dateCounts[date]+1

    #print top dates
    for date in dateCounts.most_common():
        sys.stdout.write(str(date) + "\n")`

现在输出是:

    ('2017', 738057)
    ('2016', 446204)
    ('2015', 9995)
    ('2014', 706)

但我只想按日期计算,例如:

    ('May 02 2016', 128)
    ('May 03 2016', 105)
    ('May 04 2016', 99)

正在考虑实现正则表达但不知道如何。

如何摆脱日期中间的时间戳?

2 个答案:

答案 0 :(得分:0)

我们可以使用以下代码获得预期结果。我希望这会有所帮助。

 # datecount.py
import sys, collections

# sys.argv is the list of command-line arguments
# sys.arg[0] is the name of the program itself
# sys.arg[1] is optional and will be the file name

# set input based on number of arguments
if len(sys.argv) == 1:
    f = sys.stdin
elif len(sys.argv) == 2:
    try:
        f = open(sys.argv[1])
    except IOError:
        print "Cannot open", sys.argv[1]
        sys.exit()
else:
    print "USAGE: python datecount [FILE]"
    sys.exit()

dateCounts = collections.Counter()
# for every line passed into the script
for line in f:
    # find indices of date section
    start = line.find("[")
    if start >= 0 :
        end = line.find("]", start)
        # graph just the date
        date = line[start+5:11] +' '+ line[start+21:end] #by Date and YEAR
        dateCounts[date]=dateCounts[date]+1

#print top dates
for date in dateCounts.most_common():
    sys.stdout.write(str(date) + "\n")`

答案 1 :(得分:0)

使用regexp实现:

import sys        
import collections
import re

dateCounts = collections.Counter()             
input_str = """
[Mon May 02 15:38:50 2016] [error] [client XX.XX.XX.XX] File does not exist: /home/XXX/XXXX/XXX/XXX/XXX.shtml
[Mon May 03 15:38:50 2017] [error] [client XX.XX.XX.XX] File does not exist: /home/XXX/XXXX/XXX/XXX/XXX.shtml
[Mon May 02 15:38:50 2016] [error] [client XX.XX.XX.XX] File does not exist: /home/XXX/XXXX/XXX/XXX/XXX.shtml
"""

found = re.findall("\[(.*)\].*\[.*\].*\[.*\].*", input_str, re.MULTILINE)

for date in found:                        
   dateCounts[date] = dateCounts[date] + 1

for date in dateCounts.most_common(): 
    sys.stdout.write(str(date) + "\n")

输出:

('Mon May 02 15:38:50 2016', 2)
('Mon May 03 15:38:50 2017', 1)