考虑以下可变长度的2D数组
[
[1, 2, 3],
[4, 5],
[6, 7, 8, 9]
]
如何在列中找到变量的平均值?
我想要[(1+4+6)/3,(2+5+7)/3, (3+8)/2, 9/1]
所以最终结果是[3.667, 4.667, 5.5, 9]
这可能使用numpy吗?
我尝试了np.mean(x, axis=0)
,但numpy期望相同维度的数组。
现在,我正在弹出每列的元素并找到平均值。有没有更好的方法来实现结果?
答案 0 :(得分:5)
您可以使用pandas:
import pandas as pd
a = [[1, 2, 3],
[4, 5],
[6, 7, 8, 9]]
df = pd.DataFrame(a)
# 0 1 2 3
# 0 1 2 3 NaN
# 1 4 5 NaN NaN
# 2 6 7 8 9
df.mean()
# 0 3.666667
# 1 4.666667
# 2 5.500000
# 3 9.000000
# dtype: float64
这是另一种只使用numpy的解决方案:
import numpy as np
nrows = len(a)
ncols = max(len(row) for row in a)
arr = np.zeros((nrows, ncols))
arr.fill(np.nan)
for jrow, row in enumerate(a):
for jcol, col in enumerate(row):
arr[jrow, jcol] = col
print np.nanmean(arr, axis=0)
# array([ 3.66666667, 4.66666667, 5.5 , 9. ])
答案 1 :(得分:2)
本文中列出的是使用NumPy的几乎矢量化方法。我们将尝试根据列表元素的位置为每个元素分配一个ID。然后可以将这些ID提供给np.bincount
,因为它将执行基于ID的求和。最后,我们将分别用每个ID的长度除以求和值,得到最终的平均值。
因此,我们会有这样的实现 -
def variable_mean(a):
vals = np.concatenate(a)
lens = np.array(map(len,a))
id_arr = np.ones(vals.size,dtype=int)
id_arr[0] = 0
id_arr[lens.cumsum()[:-1]] = -lens[:-1] + 1
IDs = id_arr.cumsum()
return np.bincount(IDs,vals)/np.bincount(IDs)
运行时测试 -
In [298]: # Setup input
...: N = 1000 # number of elems in input list
...: minL = 3 # min len of an element (list) in input list
...: maxL = 10 # max len of an element (list) in input list
...: a = [list(np.random.randint(0,9,(i))) \
...: for i in np.random.randint(minL,maxL,(N))]
...:
In [299]: %timeit pd.DataFrame(a).mean() #@Julien Spronck's pandas soln
100 loops, best of 3: 3.33 ms per loop
In [300]: %timeit variable_mean(a)
100 loops, best of 3: 2.36 ms per loop
In [301]: # Setup input
...: N = 1000 # number of elems in input list
...: minL = 3 # min len of an element (list) in input list
...: maxL = 100 # max len of an element (list) in input list
...: a = [list(np.random.randint(0,9,(i))) \
...: for i in np.random.randint(minL,maxL,(N))]
...:
In [302]: %timeit pd.DataFrame(a).mean() #@Julien Spronck's pandas soln
10 loops, best of 3: 27.1 ms per loop
In [303]: %timeit variable_mean(a)
100 loops, best of 3: 9.58 ms per loop
答案 2 :(得分:2)
使用itertools.izip_longest()
作为非常简单的替代方法:
>>> mean_list = []
>>> for sub_list in izip_longest(*my_list):
... filtered_list = filter(None, sub_list)
... mean_list.append(sum(filtered_list)/(len(filtered_list)*1.0))
...
>>> mean_list
[3.6666666666666665, 4.666666666666667, 5.5, 9.0]
my_list
等于:
[
[1, 2, 3],
[4, 5],
[6, 7, 8, 9]
]
答案 3 :(得分:0)
如果你想手动完成,我会做什么:
max_length = 0
计算最大数组长度:
for array in arrays:
if len(array) > max:
max = len(array)
使用'None'
for array in arrays:
while len(array) < max:
array.append(None)
Zip会将列分组
columns = zip(*arrays)
columns == [(1, 4, 6), (2, 5, 7), (3, 'None', 8), ('None', 'None', 9)]
计算任何列表的平均值:
for col in columns:
count = 0
sum = 0.0
for num in col:
if num is not None:
count += 1
sum += float(num)
print "%s: Avg %s" % (col, sum/count)
或者在填充数组后作为列表理解:
[sum(filter(None, col))/float(len(filter(None, col))) for col in zip(*arrays)]
输出:
(1, 4, 6): Avg 3.66666666667
(2, 5, 7): Avg 4.66666666667
(3, 'None', 8): Avg 5.5
('None', 'None', 9): Avg 9.0
答案 4 :(得分:0)
在Py3中,zip_longest
采用fillvalue
参数:
In [1208]: ll=[
...: [1, 2, 3],
...: [4, 5],
...: [6, 7, 8, 9]
...: ]
In [1209]: list(itertools.zip_longest(*ll, fillvalue=np.nan))
Out[1209]: [(1, 4, 6), (2, 5, 7), (3, nan, 8), (nan, nan, 9)]
通过填写nan
,我可以使用np.nanmean
取无效nan
。 nanmean
将其输入(此处_
从上一行)转换为数组:
In [1210]: np.nanmean(_, axis=1)
Out[1210]: array([ 3.66666667, 4.66666667, 5.5 , 9. ])