嗨,所以很难在标题中正确解释这一点,但首先让我先解释一下我的数据。我在列表中存储了40个列表,其中包含如下形式:
data[0] = [[value1 value2 value3,80],[value1,90],[value1 value3,60],[value2 value3,70]]
data[1] = [[value2,40],[value1 value2 value3,90]]
data[2] = [[value1 value2,80],[value1,50],[value1 value3,20]]
.
.
.
现在我期待输出如下:
data[0] = [[value1 value2 value3,80],[value1,90],[value1 value3,60],[value2 value3,70],[value2,0],[value1 value2,0]]
data[1] = [[value2,40],[value1 value2 value3,90],[value1,0],[value1 value3,0],[value2 value3,0],[value1 value2,0]]
data[2] = [[value1 value2,80],[value1,50],[value1 value3,20],[value1 value2 value3,0],[value2 value3,0],[value2,0]]
我知道这有点复杂,但我想确保有一个很好的数据演示。因此,基本上所有列表都需要具有所有列表中存在的值的所有可能组合,如果该列表中不存在该组合作为标准,那么它的频率(第二个字段)为0。
感谢您的帮助,请记住这是40个不同列表的交集,因此需要快速有效。我不确定如何最好地做到这一点...
编辑:我也不知道所有'值',为了简单起见,我在这里写了3个不同的值(value1,value2,value3)。在我的项目中,我不知道它们的价值是多少或有多少不同(我知道至少有几千个)
编辑2:这是一些真正的输入数据,我没有真正的输出数据,但我会尝试解决它:
data[0] = [['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP syslog_priority:Info', '39.7769'], ['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP', '39.7769'], ['destination_ip:10.32.0.100 destination_service:http destination_port:80 syslog_priority:Info', '39.7769'], ['destination_ip:10.32.0.100 destination_service:http destination_port:80', '39.7769'], ['destination_ip:10.32.0.100 destination_service:http protocol:TCP syslog_priority:Info', '39.7769']]
data[1] = [['syslog_priority:Info', '100'], ['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80 protocol:TCP', '43.8362'], ['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80', '43.8362'], ['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http protocol:TCP', '43.8362'], ['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http', '43.8362']]
data[2] = [['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info protocol:TCP', '43.9506'], ['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info', '43.9506'], ['destination_ip:10.32.0.100 destination_port:80 destination_service:http protocol:TCP', '43.9506'], ['destination_ip:10.32.0.100 destination_port:80 destination_service:http', '43.9506'], ['destination_ip:10.32.0.100 destination_port:80 syslog_priority:Info protocol:TCP', '43.9506']]
答案 0 :(得分:4)
听起来你可以使用套装:
>>> {1, 2, 3, 4, 5} & {2, 3, 4, 5, 6, 7} & {3, 4, 5}
{3, 4, 5}
&
是集合的交集运算符。获取一组列表(这将删除set(mylist)
的重复元素。
编辑:根据您的评论,您似乎需要的是某种联合(联合运算符为|
),而不是交集。
这是一个函数,可以在评论中为2个列表列表执行您想要的操作:
def function(first, second):
first_set = {tuple(i) for i in first}
second_set = {tuple(i) for i in second}
return (first_set | {(i[0], 0) for i in second_set},
second_set | {(i[0], 0) for i in first_set})
>>> a = [(1,60),(3,90)]
>>> b = [(2,30),(4,50)]
>>> x, y = function(a, b)
>>> print(x)
{(2, 0), (3, 90), (1, 60), (4, 0)}
>>> print(y)
{(3, 0), (4, 50), (1, 0), (2, 30)}
答案 1 :(得分:1)
考虑到你的评论,我会使用已经建议的套装
首先遍历列表以构建一组每个可能的字符串
possible_strings = set()
for row in mydata:
for item in row:
possible_string.add(item[0])
因此,possible_strings在您的数据中包含所有可能的字符串
现在你需要检查每一行的字符串,如果它不存在你需要将它附加到频率为0的行
my_new_data = []
for row in mydata:
row_strings = set(item[0] for item in row)
missing_strings = possible_strings - row_strings
for item in list(missing_strings):
new_item = []
new_item.append(item)
new_item.append(0)
row.append(new_item)
row.sort()
my_new_data.append(row)
我将使用集合的原因是您不必进行任何查找,并且项目是字符串,因此它们可以是集合的成员。有办法加快速度(压缩代码),但我喜欢把事情做好,所以我可以清楚地看到我在做什么。除非我输错了(我已经纠正了3)这个代码在我的电脑上运行了
以下是未排序的结果
newrow*************
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP syslog_priority:Info', '39.7769']
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP', '39.7769']
['destination_ip:10.32.0.100 destination_service:http destination_port:80 syslog_priority:Info', '39.7769']
['destination_ip:10.32.0.100 destination_service:http destination_port:80', '39.7769']
['destination_ip:10.32.0.100 destination_service:http protocol:TCP syslog_priority:Info', '39.7769']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http', 0]
['destination_ip:10.32.0.100 destination_port:80 syslog_priority:Info protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80 protocol:TCP', 0]
['syslog_priority:Info', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http', 0]
newrow*************
['syslog_priority:Info', '100']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80 protocol:TCP', '43.8362']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80', '43.8362']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http protocol:TCP', '43.8362']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http', '43.8362']
['destination_ip:10.32.0.100 destination_port:80 syslog_priority:Info protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP syslog_priority:Info', 0]
['destination_ip:10.32.0.100 destination_service:http protocol:TCP syslog_priority:Info', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 syslog_priority:Info', 0]
newrow*************
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info protocol:TCP', '43.9506']
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info', '43.9506']
['destination_ip:10.32.0.100 destination_port:80 destination_service:http protocol:TCP', '43.9506']
['destination_ip:10.32.0.100 destination_port:80 destination_service:http', '43.9506']
['destination_ip:10.32.0.100 destination_port:80 syslog_priority:Info protocol:TCP', '43.9506']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP syslog_priority:Info', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80 protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http protocol:TCP syslog_priority:Info', 0]
['syslog_priority:Info', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 syslog_priority:Info', 0]
答案 2 :(得分:1)
听起来你想要字典,然后你想要比较键,它们是你拥有它们的“值”列表,而不是字典值,它们是频率。当然,没有必要将数据重组为字典,但它可能更有意义。
现在,对于一个实际的答案:创建一个新的列表/字典只是为了将所有键/“值列表”的完整列表放在一起。然后,再次进行操作,并将缺少的元素添加到缺少它们的列表中。外环经过40次。第一个外环是O(n * 2),其中n是唯一键的总数,但我想平均情况将小于n * 2。第二个外环也是O(n ** 2)。
我希望这不是太暴力。至少它比将数据[n]与数据[n + m]与n 0-40进行比较更好......对于外部循环来说,这是40 ** 2 ...这仍然是一个常数,但显然是比80更大。
答案 3 :(得分:1)
如果我错了,请纠正我,但我认为最好的解决方案是每个所需输出的字典和一组主键。一个集合基本上存储每个值而不允许重复。通过上面的例子,我会这样做:
master_set = set()
for current_list in list_of_lists:
master_set |= [entry[0] for entry in current_list]
其中|=
实际上是集合的联合运算符。
一旦你有了这个集合,你就会为每个包含相关值或零的条目构建一个字典。首先,我将构建一个字典,然后我只是添加缺少项目的结果。
full_dictionary = {}
for entry in master_set:
full_dictionary[entry] = [thing[1] for thing in current_list if thing[0] == entry]
然后只为每个列表生成完整的字典。
或者,如果你可以选择数据是如何进入的,或者只是想合理地重构数据,我建议使用字典理解,这样可以简化这一点:
new_dict = {value[0]: value[1] for value in current_list}
我在解释这个问题时也遇到了一些麻烦,但请告诉我这是不准确的,我可以修改它。