我正在寻找比较列表中包含源IP,目标IP,数据包时间和大小的多行。我想在具有相同源IP和目标IP的所有行之间合并数据。例如,如果有2条或更多行具有相同的源IP和目标IP,我该如何合并所有数据。我不想只比较第一行和第二行,我想匹配列表中具有相同的172.217.2.161(源)和10.247.15.39(目标)的所有行,然后提取第一个时间戳和最后一个时间戳记到新列表中。
def combine_data(source, dest, time, length):
CombinePacket = [(source[i], dest[i], time[i], length[i]) for i in range(len(source))]
NewData = []
TotalSize = 0
for i, j in zip(CombinePacket, CombinePacket[1:]):
if(i[0:2] == j[0:2]):
TotalSize = TotalSize + int(i[3])+int(j[3])
data = i[0], i[1], i[2], j[2], TotalSize
NewData.append(data)
列表包含
[(['172.217.2.161'], ['10.247.15.39'], '13:25:31.044180', 46)]
[(['172.217.2.161'], ['10.247.15.39'], '13:25:31.044190', 29)]
[(['172.217.2.161'], ['10.247.15.39'], '13:25:31.044200' 50)]
输出应为
[['172.217.2.161'], ['10.247.15.39'],'13:25:31.044180', '13:25:31.044200', 125]
答案 0 :(得分:1)
您可以使用itertools.groupby
进行此类任务from __future__ import print_function
import itertools
def key(packet):
return packet[0], packet[1] # source and destination
def do_combine_data(sources, destinations, times, lengths):
packets = zip(sources, destinations, times, lengths)
for (packet_source, packet_dest), group in itertools.groupby(
sorted(packets, key=key), key=key):
group = list(group)
packet_sizes = [packet_size for (_, _, _, packet_size) in group]
packet_times = [at for (_, _, at, _) in group]
start_time, end_time = [func(packet_times) for func in (min, max)]
total_size = sum(packet_sizes)
yield packet_source, packet_dest, start_time, end_time, total_size
之后,您可以根据需要使用它(甚至将source
和destination
包装在自己的列表中)
def combine_data(source, dest, time, length):
return [
([[s], [d], b, e, t])
for s, d, b, e, t in do_combine_data(source, dest, time, length)]
def main():
sources = ["a", "a", "a", "a", "a"]
destinations = ["b", "b", "b", "c", "c"]
times = ["1", "2", "5", "3", "4"]
lengths = [12, 11, 51, 89, 17]
print(combine_data(sources, destinations, times, lengths))
if __name__ == '__main__':
main()
输出将为
[[['a'], ['b'], '1', '5', 74], [['a'], ['c'], '3', '4', 106]]
答案 1 :(得分:0)
保留字典并随时更新值,然后将其转换为列表。假设您有一个像这样的列表:
data = [[(['172.217.2.161'], ['10.247.15.39'], '13:25:31.044180', 46)],
[(['172.217.2.161'], ['10.247.15.39'], '13:25:31.044190', 29)],
[(['172.217.2.161'], ['10.247.15.39'], '13:25:31.044200' 50)]]
然后:
d = dict()
for dat in data:
sourceIp = dat[0][0][0]
destIp = dat[0][1][0]
minTs = dat[0][2]
maxTs = dat[0][3]
count = dat[0][4]
k = (sourceIp, destIp)
if (k not in d):
d[k] = (minTs, maxTs, count)
else:
val = d[k]
d[k] = (min(minTs, val[0]), max(maxTs, val[1]), count + val[2])
output = [ [[k[0]], [k[1]], v[0], v[1], v[2]] for (k,v) in d.items() ]
当然,您可以构建此词典而不是首先构建列表,以避免中介列表。另外,如果您不需要IP,我建议不要使用IP的单例列表,因为它只会导致索引混乱。
答案 2 :(得分:0)
这是我的主意:
data = [
(['172.217.2.161'], ['10.247.15.39'], '13:25:31.044180', 46),
(['172.217.2.161'], ['10.247.15.39'], '13:25:31.044190', 29),
(['172.217.2.161'], ['10.247.15.39'], '13:25:31.044200', 50)
]
source = [d[0] for d in data]
dest = [d[1] for d in data]
time = [d[2] for d in data]
length = [d[3] for d in data]
from collections import defaultdict
import datetime
def combine_data(source, dest, time, length):
CombinePacket = [(source[i], dest[i], time[i], length[i]) for i in range(len(source))]
NewData = []
TotalSize = 0
data = defaultdict(list)
for package in CombinePacket:
data[(package[0][0],package[1][0])].append((package[2],package[3]))
result = []
for key,value in data.items():
value = sorted(value,key = lambda x : x[0])
first_time = value[0][0]
last_time = value[-1][0]
sum_length = sum(v[1] for v in value)
result.append([key[0],key[1],first_time,last_time,sum_length])
return result
将数据保存到键为(source,dest)
的字典中,然后对时间进行排序以获得第一个和最后一个时间戳,并且totalsize是该值内所有大小的总和。