查找列表之间的差异并将差异附加到列表,但是对于40个不同的列表 - python

时间:2013-08-09 12:26:44

标签: python list set-intersection

嗨,所以很难在标题中正确解释这一点,但首先让我先解释一下我的数据。我在列表中存储了40个列表,其中包含如下形式:

data[0] = [[value1 value2 value3,80],[value1,90],[value1 value3,60],[value2 value3,70]]
data[1] = [[value2,40],[value1 value2 value3,90]]
data[2] = [[value1 value2,80],[value1,50],[value1 value3,20]]
   .
   .
   .

现在我期待输出如下:

data[0] = [[value1 value2 value3,80],[value1,90],[value1 value3,60],[value2 value3,70],[value2,0],[value1 value2,0]]
data[1] = [[value2,40],[value1 value2 value3,90],[value1,0],[value1 value3,0],[value2 value3,0],[value1 value2,0]]
data[2] = [[value1 value2,80],[value1,50],[value1 value3,20],[value1 value2 value3,0],[value2 value3,0],[value2,0]]    

我知道这有点复杂,但我想确保有一个很好的数据演示。因此,基本上所有列表都需要具有所有列表中存在的值的所有可能组合,如果该列表中不存在该组合作为标准,那么它的频率(第二个字段)为0。

感谢您的帮助,请记住这是40个不同列表的交集,因此需要快速有效。我不确定如何最好地做到这一点...

编辑:我也不知道所有'值',为了简单起见,我在这里写了3个不同的值(value1,value2,value3)。在我的项目中,我不知道它们的价值是多少或有多少不同(我知道至少有几千个)

编辑2:这是一些真正的输入数据,我没有真正的输出数据,但我会尝试解决它:

data[0] = [['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP syslog_priority:Info', '39.7769'], ['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP', '39.7769'], ['destination_ip:10.32.0.100 destination_service:http destination_port:80 syslog_priority:Info', '39.7769'], ['destination_ip:10.32.0.100 destination_service:http destination_port:80', '39.7769'], ['destination_ip:10.32.0.100 destination_service:http protocol:TCP syslog_priority:Info', '39.7769']]


data[1] = [['syslog_priority:Info', '100'], ['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80 protocol:TCP', '43.8362'], ['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80', '43.8362'], ['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http protocol:TCP', '43.8362'], ['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http', '43.8362']]


data[2] = [['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info protocol:TCP', '43.9506'], ['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info', '43.9506'], ['destination_ip:10.32.0.100 destination_port:80 destination_service:http protocol:TCP', '43.9506'], ['destination_ip:10.32.0.100 destination_port:80 destination_service:http', '43.9506'], ['destination_ip:10.32.0.100 destination_port:80 syslog_priority:Info protocol:TCP', '43.9506']]

4 个答案:

答案 0 :(得分:4)

听起来你可以使用套装:

>>> {1, 2, 3, 4, 5} & {2, 3, 4, 5, 6, 7} & {3, 4, 5}
{3, 4, 5}

&是集合的交集运算符。获取一组列表(这将删除set(mylist)的重复元素。

编辑:根据您的评论,您似乎需要的是某种联合(联合运算符为|),而不是交集。 这是一个函数,可以在评论中为2个列表列表执行您想要的操作:

def function(first, second):
    first_set = {tuple(i) for i in first}
    second_set = {tuple(i) for i in second}
    return (first_set | {(i[0], 0) for i in second_set},
            second_set | {(i[0], 0) for i in first_set})

>>> a = [(1,60),(3,90)]
>>> b = [(2,30),(4,50)]
>>> x, y = function(a, b)
>>> print(x)
{(2, 0), (3, 90), (1, 60), (4, 0)}
>>> print(y)
{(3, 0), (4, 50), (1, 0), (2, 30)}

答案 1 :(得分:1)

考虑到你的评论,我会使用已经建议的套装

首先遍历列表以构建一组每个可能的字符串

possible_strings = set()
for row in mydata:
   for item in row:
       possible_string.add(item[0])

因此,possible_strings在您的数据中包含所有可能的字符串

现在你需要检查每一行的字符串,如果它不存在你需要将它附加到频率为0的行

my_new_data = []
for row in mydata:
    row_strings = set(item[0] for item in row)
    missing_strings = possible_strings - row_strings
    for item in list(missing_strings):
         new_item = []
         new_item.append(item)
         new_item.append(0)
         row.append(new_item)
     row.sort()
     my_new_data.append(row)

我将使用集合的原因是您不必进行任何查找,并且项目是字符串,因此它们可以是集合的成员。有办法加快速度(压缩代码),但我喜欢把事情做好,所以我可以清楚地看到我在做什么。除非我输错了(我已经纠正了3)这个代码在我的电脑上运行了

以下是未排序的结果

newrow*************
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP syslog_priority:Info', '39.7769']
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP', '39.7769']
['destination_ip:10.32.0.100 destination_service:http destination_port:80 syslog_priority:Info', '39.7769']
['destination_ip:10.32.0.100 destination_service:http destination_port:80', '39.7769']
['destination_ip:10.32.0.100 destination_service:http protocol:TCP syslog_priority:Info', '39.7769']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http', 0]
['destination_ip:10.32.0.100 destination_port:80 syslog_priority:Info protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80 protocol:TCP', 0]
['syslog_priority:Info', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http', 0]
newrow*************
['syslog_priority:Info', '100']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80 protocol:TCP', '43.8362']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80', '43.8362']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http protocol:TCP', '43.8362']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http', '43.8362']
['destination_ip:10.32.0.100 destination_port:80 syslog_priority:Info protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP syslog_priority:Info', 0]
['destination_ip:10.32.0.100 destination_service:http protocol:TCP syslog_priority:Info', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_port:80 destination_service:http', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 syslog_priority:Info', 0]
newrow*************
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info protocol:TCP', '43.9506']
['destination_ip:10.32.0.100 destination_port:80 destination_service:http syslog_priority:Info', '43.9506']
['destination_ip:10.32.0.100 destination_port:80 destination_service:http protocol:TCP', '43.9506']
['destination_ip:10.32.0.100 destination_port:80 destination_service:http', '43.9506']
['destination_ip:10.32.0.100 destination_port:80 syslog_priority:Info protocol:TCP', '43.9506']
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 protocol:TCP syslog_priority:Info', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http destination_port:80 protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http protocol:TCP syslog_priority:Info', 0]
['syslog_priority:Info', 0]
['destination_ip:10.32.0.100 syslog_priority:Info destination_service:http protocol:TCP', 0]
['destination_ip:10.32.0.100 destination_service:http destination_port:80 syslog_priority:Info', 0]

答案 2 :(得分:1)

听起来你想要字典,然后你想要比较键,它们是你拥有它们的“值”列表,而不是字典值,它们是频率。当然,没有必要将数据重组为字典,但它可能更有意义。

现在,对于一个实际的答案:创建一个新的列表/字典只是为了将所有键/“值列表”的完整列表放在一起。然后,再次进行操作,并将缺少的元素添加到缺少它们的列表中。外环经过40次。第一个外环是O(n * 2),其中n是唯一键的总数,但我想平均情况将小于n * 2。第二个外环也是O(n ** 2)。

我希望这不是太暴力。至少它比将数据[n]与数据[n + m]与n 0-40进行比较更好......对于外部循环来说,这是40 ** 2 ...这仍然是一个常数,但显然是比80更大。

答案 3 :(得分:1)

如果我错了,请纠正我,但我认为最好的解决方案是每个所需输出的字典和一组主键。一个集合基本上存储每个值而不允许重复。通过上面的例子,我会这样做:

master_set = set()
for current_list in list_of_lists:
    master_set |= [entry[0] for entry in current_list] 

其中|=实际上是集合的联合运算符。

一旦你有了这个集合,你就会为每个包含相关值或零的条目构建一个字典。首先,我将构建一个字典,然后我只是添加缺少项目的结果。

full_dictionary = {}
for entry in master_set:
    full_dictionary[entry] = [thing[1] for thing in current_list if thing[0] == entry]

然后只为每个列表生成完整的字典。

或者,如果你可以选择数据是如何进入的,或者只是想合理地重构数据,我建议使用字典理解,这样可以简化这一点:

new_dict = {value[0]: value[1] for value in current_list}

我在解释这个问题时也遇到了一些麻烦,但请告诉我这是不准确的,我可以修改它。