所以我有两个数据列表,看起来像这样(缩短了):
[[1.0, 1403603100],
[0.0, 1403603400],
[2.0, 1403603700],
[0.0, 1403604000],
[None, 1403604300]]
[1.0, 1403603100],
[0.0, 1403603400],
[1.0, 1403603700],
[None, 1403604000],
[5.0, 1403604300]]
我想要做的是合并它们,对每个数据集的第一个元素求和,或者如果任一计数器值为None,则将其合并为0.0。所以上面的例子会变成这样:
[[2.0, 1403603100],
[0.0, 1403603400],
[3.0, 1403603700],
[0.0, 1403604000],
[0.0, 1403604300]]
这是我到目前为止所提出的,如果它有点笨拙而道歉。
def emit_datum(datapoints):
for datum in datapoints:
yield datum
def merge_data(data_set1, data_set2):
assert len(data_set1) == len(data_set2)
data_length = len(data_set1)
data_gen1 = emit_datum(data_set1)
data_gen2 = emit_datum(data_set2)
merged_data = []
for _ in range(data_length):
datum1 = data_gen1.next()
datum2 = data_gen2.next()
if datum1[0] is None or datum2[0] is None:
merged_data.append([0.0, datum1[1]])
continue
count = datum1[0] + datum2[0]
merged_data.append([count, datum1[1]])
return merged_data
我只能希望/假设我可以用itertools或者集合做一些狡猾的事情?
答案 0 :(得分:1)
如何基于标识符“合并”数据,即收集对应于一个标识符(例如1403603400)的所有值,并稍后对其求和。字典非常适合收集与标识符(键)对应的所有值,而类型列表的defaultdict使这一点变得特别简单:
>>> data = [[1.0, 1403603100], [1.0, 1403603100],
... [0.0, 1403603400], [0.0, 1403603400],
... [2.0, 1403603700], [1.0, 1403603700],
... [0.0, 1403604000], [None, 1403604000],
... [None, 1403604300], [5.0, 1403604300]]
>>> from collections import defaultdict
>>> d = defaultdict(list)
>>> for value, identifier in data:
... d[identifier].append(value)
...
现在我们对数据进行了排序,并可以有条件地对其进行求和:
>>> for identifier, valuelist in d.iteritems():
... if not None in valuelist:
... print identifier, sum(valuelist)
... else:
... print identifier, 0.0
...
1403603400 0.0
1403603700 3.0
1403603100 2.0
1403604300 0.0
1403604000 0.0
最后一部分,为了获得你想要的列表:
>>> [[i, sum(v)] if None not in v else [i, .0] for i, v in d.iteritems()]
[[1403603400, 0.0], [1403603700, 3.0], [1403603100, 2.0], [1403604300, 0.0], [1403604000, 0.0]]
这种方法要求首先混合数据集,就像在示例输入的第一个版本中一样。
答案 1 :(得分:1)
如果要使两个值都等于0.0,如果其中任何一个为None,则只需要一个简单的循环。
l1 = [1.0, 1403603100],
[0.0, 1403603400],
[2.0, 1403603700],
[0.0, 1403604000],
[None, 1403604300]]
l2 = [[1.0, 1403603100],
[0.0, 1403603400],
[1.0, 1403603700],
[None, 1403604000],
[5.0, 1403604300]]
final = []
assert len(l1)== len(l2)
for x, y in zip(l1, l2):
if x[0] is None or y[0] is None:
y[0] = 0.0
final.append(y)
else:
final.append([x[0] + y[0], x[-1]])
print final
[[2.0, 1403603100], [0.0, 1403603400], [3.0, 1403603700], [0.0, 1403604000], [0.0, 1403604300]]
In [51]: %timeit merge_data(l1,l2)
100000 loops, best of 3: 5.76 µs per loop
In [52]: %%timeit
....: final = []
....: assert len(l1)==len(l2)
....: for x, y in zip(l1, l2):
....: if x[0] is None or y[0] is None:
....: y[0] = 0.0
....: final.append(y)
....: else:
....: final.append([x[0] + y[0], x[-1]])
....:
100000 loops, best of 3: 2.64 µs per loop
答案 2 :(得分:0)
使用numpy数组,您不需要进行任何循环。如果您处理更大的数据集,这会使您的代码更快。
import numpy as np
In [68]: a = np.asarray(a)
In [69]: b = np.asarray(b)
In [71]: a_none_idx = np.equal(a,None)
In [72]: b_none_idx = np.equal(b,None)
In [73]: a[a_none_idx]=0
In [74]: b[b_none_idx]=0
In [76]: c = np.zeros(a.shape)
In [77]: c[:,0]= a[:,0] + b[:,0]
In [78]: c
Out[78]:
array([[ 2., 0.],
[ 0., 0.],
[ 3., 0.],
[ 0., 0.],
[ 5., 0.]])
In [79]: c[a_none_idx]=0
In [80]: c[b_none_idx]=0
In [81]: c[:,1] = a[:,1]
In [82]: c
Out[82]:
array([[ 2.00000000e+00, 1.40360310e+09],
[ 0.00000000e+00, 1.40360340e+09],
[ 3.00000000e+00, 1.40360370e+09],
[ 0.00000000e+00, 1.40360400e+09],
[ 0.00000000e+00, 1.40360430e+09]]
答案 3 :(得分:0)
您可以使用zip
,如下所示:
def merge(list1, list2):
returnlist = []
for x, y in zip(list1, list2):
if x[0] is None or y[0] is None:
returnlist.append([0.0, x[1]])
else:
returnlist.append([x[0] + y[0], x[1]])
return returnlist
zip
返回包含来自每个输入列表中具有相同索引的元素的元组的迭代器(即(list1[0], list2[0])
,(list1[1], list2[1])
等。)