我有一个输入文件,格式如下。这只是一个示例文件,实际文件以相同的方式包含许多条目:
0.0 aa:bb:cc dd:ee:ff 100 000 ---------->line1
0.2 aa:bb:cc dd:ee:ff 101 011 ---------->line2
0.5 dd:ee:ff aa:bb:cc 230 001 ---------->line3
0.9 dd:ee:ff aa:bb:cc 231 110 ---------->line4
1.2 dd:ee:ff aa:bb:cc 232 101 ---------->line5
1.4 aa:bb:cc dd:ee:ff 102 1111 ---------->line6
1.6 aa:bb:cc dd:ee:ff 103 1101 ---------->line7
1.7 aa:bb:cc dd:ee:ff 108 1001 ---------->line8
2.4 dd:ee:ff aa:bb:cc 233 1000 ---------->line9
2.8 gg:hh:ii jj:kk:ll 450 1110 ---------->line10
3.2 jj:kk:ll gg:hh:ii 600 010 ---------->line11
第一列表示时间戳,第二个源地址,第三个目标地址,第四个序列号,第五个不需要。
对于这个问题,组的定义:
i. The lines should be consecutive(lines 1 and 2)
ii. Should have same second and third column, but fourth column should be differed by 1.
我需要计算组中第一行和下一行的第一行的时间戳差异,对应于相同的所有组(column2,column3)。
例如,对应于(aa:bb:cc dd:ee:ff)的组是(line1,line2)& (lin6,line7)& (line8)。最终输出应该是,(aa:bb:cc dd:ee:ff)= [1.4 0.3]。
因为1.4 = line6,line1之间的时间戳差异。 0.3是(aa:bb:cc dd:ee:ff)条目的第8行第8行之间的时差。
这些应该针对所有(column2 column3)对计算。
我编写了一个程序,用于计算组中成员的数量,如下所示:
#!/usr/bin/python
with open("luawrite") as f:
#read the first line and set the number from it as the value of `prev`
num = next(f).rsplit(None,2)[-2:]
prev = int(num)
count = 1 #initialize `count` to 1
for lin in f:
num = lin.rsplit(None,2)[-2:]
num = int(num) #use `str.rsplit` for minimum splits
if num - prev == 1: #if current `num` - `prev` == 1
count+=1 # increment `count`
prev = num # set `prev` = `num`
else:
print count #else print `count` or write it to a file
count = 1 #reset `count` to 1
prev = num #set `prev` = `num`
if num - prev !=1:
print count
我通过将第2列和第3列作为字典键来尝试各种方法,但是有多个组对应于相同的键。这听起来对我来说是一项艰巨的任务。请帮我解决这个棘手的问题。
答案 0 :(得分:2)
from collections import defaultdict
data = list()
groups = defaultdict(list)
i = 1
with open('input') as f:
for line in f:
row = line.strip().split() + [ i ]
gname = " ".join(row[1:3])
groups[gname] += [ row ]
i += 1
output = defaultdict(list)
for gname, group in groups.items():
gr = []
last_key,last_col4, last_idx='',-1,-1
for row in group:
key, idx = " ".join(row[1:3]), int(row[-1])
keys_same = last_key == key and last_col4 + 1 == int(row[3])
consequtive = last_idx + 1 == idx
if not (gr and keys_same and consequtive):
if gr: output[gr[0][1]] += [ float(row[0]) - float(gr[0][0]) ]
gr = [ row ]
else: gr += [ row ]
last_key, last_col4, last_idx = key, int(row[3]), idx
for k,v in output.items():
print k, ' --> ', v
答案 1 :(得分:1)
itertools.groupby()
可用于提取由以下内容定义的组:
我。这些行应该是连续的(第1行和第2行)
II。应该有 相同的第二和第三列,但第四列应该相差1
然后collections.defaultdict()
可用于收集时间戳以找出差异:
我需要计算组中第一行的时间戳差异 下一个的第一行,对应于相同的所有组 (第2栏,第3栏)。
from collections import defaultdict
from itertools import groupby
import sys
file = sys.stdin # could be anything that yields lines e.g., a regular file
rows = (line.split() for line in file if line.strip())
# get timestamps map: (source, destination) -> timestamps of 1st lines
timestamps = defaultdict(list)
for ((source, dest), _), group in groupby(enumerate(rows),
key=lambda (i, row): (row[1:3], i - int(row[3]))):
ts = float(next(group)[1][0]) # a timestamp from the 1st line in a group
timestamps[source, dest].append(ts)
# find differences
for (source, dest), t in sorted(timestamps.items(), key=lambda (x,y): x):
diffs = [b - a for a, b in zip(t, t[1:])] # pairwise differences
info = ", ".join(map(str, diffs)) if diffs else t # support unique
print("{source} {dest}: {info}".format(**vars()))
aa:bb:cc dd:ee:ff: 1.4, 0.3
dd:ee:ff aa:bb:cc: 1.9
gg:hh:ii jj:kk:ll: [2.8]
jj:kk:ll gg:hh:ii: [3.2]
[]
表示输入中有一组相应的(源地址,目标地址)对,即没有任何东西可以构造差异。你可以prepend a dummy 0.0
timestamp to the timestamps lists to handle all cases uniformly。