Question

我有一个返回列子集的函数，我想有效地将它应用于每一列。因此，结果不再是矩阵，而是不同长度的列的列表。由于这种尺寸不匹配，我没有使用numpy apply_along_axis这样做。有没有办法有效地这样做，除了自己迭代列？

col_pred = lambda x: [v for v in x if v > 0.5]
filteredData = np.apply_along_axis(col_pred, 0, data)
# ValueError: could not broadcast input array from shape (3) into shape (4)

例如输入

data = [[0, 1, 1, 0], [1, 1, 1, 1]]
// my real data is more like a matrix with a lot of rows in [0-1]
// that can be simulated with 
// data = [[random.uniform(0, 1) for i in range(10)] for j in range(100000)]

我想得到

[[1, 1], [1, 1, 1, 1]]

Answer 1

查看代码，您似乎尝试输出每列大于阈值0.5的所有元素。这是一种实现这些目标的方法，也可以推广用于处理行和列中的那些 -

def threshold_along_an_axis(a, thresh = 0.5, axis=0):
    if axis==0:
        A = a.T
    else:
        A = a
    mask = A>thresh
    s = mask.sum(1)
    s0 = np.r_[0,s.cumsum()]
    arr = A[mask].tolist() # Skip .tolist() if list of arrays is needed as o/p
    return [arr[s0[i]:s0[i+1]] for i in range(len(s0)-1)]

这里的目的是在循环理解中做最小的工作。

示例运行 -

In [1]: a = np.random.rand(4,5)

In [2]: a
Out[2]: 
array([[ 0.45973245,  0.3671334 ,  0.12000436,  0.04205402,  0.74729737],
       [ 0.55217308,  0.4018889 ,  0.55695863,  0.55824384,  0.33435153],
       [ 0.32450124,  0.07713855,  0.09126221,  0.13150986,  0.27961361],
       [ 0.0876053 ,  0.42685005,  0.53034652,  0.15084453,  0.51518185]])

In [3]: threshold_along_an_axis(a, thresh=0.5, axis=0) # per column
Out[3]: 
[[0.5521730819881912],
 [],
 [0.5569586261866918, 0.5303465159370833],
 [0.5582438446718111],
 [0.7472973699509776, 0.5151818458812673]]

In [4]: threshold_along_an_axis(a, thresh=0.5, axis=1) # per row
Out[4]: 
[[0.7472973699509776],
 [0.5521730819881912, 0.5569586261866918, 0.5582438446718111],
 [],
 [0.5303465159370833, 0.5151818458812673]]

Answer 2

如果你想在numpy中使用一个参差不齐的数组，你必须使用对象数组。

首先需要一个小辅助函数将任何值转换为0d对象数组：

def object_scalar(x):
    obj = np.empty((), dtype=object)
    obj[()] = x
    return obj

然后，在即将到来的1.13中，您可以这样做：

>>> f = lambda x: object_scalar(col_pred(x))
>>> np.apply_along_axis(f, 0, data)
array([list([1]), list([1, 1]), list([1, 1]), list([1])], dtype=object)

不幸的是，numpy的最新发布的版本有一个错误，使apply_along_axis无法正确处理0d数组。您可以通过升级到1d数组来解决这个问题，然后逐渐降级为0d：

>>> np.apply_along_axis(lambda x: f(x)[np.newaxis], 0, data).squeeze(axis=0)
array([[1], [1, 1], [1, 1], [1]], dtype=object)

Answer 3

因此，作为Python列表问题，这是：

In [606]: col_pred = lambda x: [v for v in x if v > 0.5]
In [607]: data = [[0, 1, 1, 0], [1, 1, 1, 1]]
In [608]: [col_pred(i) for i in data]
Out[608]: [[1, 1], [1, 1, 1, 1]]

在您的大数据示例中，生成数据所需的时间比运行此列表理解要长得多：

In [611]: data1 = [[np.random.uniform(0, 1) for i in range(10)] for j in range(100000)]
In [612]: timeit data1 = [[np.random.uniform(0, 1) for i in range(10)] for j in range(100000)]
1 loop, best of 3: 2.62 s per loop

In [615]: data2=[col_pred(i) for i in data1]
In [618]: timeit data2=[col_pred(i) for i in data1]
10 loops, best of 3: 191 ms per loop

将此与`@Divakar的高效numpy解决方案

进行比较

In [622]: threshold_along_an_axis(np.array(data).T)
Out[622]: [[1, 1], [1, 1, 1, 1]]
In [624]: x3=threshold_along_an_axis(np.array(data1).T)
In [625]: timeit x3=threshold_along_an_axis(np.array(data1).T)
10 loops, best of 3: 214 ms per loop

哎呀 - 它慢了;除非我们在时间之外进行数组创建步骤：

In [626]: arr=np.array(data1).T
In [627]: timeit x3=threshold_along_an_axis(arr)
10 loops, best of 3: 128 ms per loop

这是一个古老熟悉的故事。列表推导通常对小列表表现良好，并且当数组创建增加了显着的开销时。

我没有1.13以及Eric提到的新内容，但是将列表更改为对象数组确实让我使用np.frompyfunc：

In [640]: dataO = np.empty(len(data1), object)
In [641]: dataO[:]=data1
In [642]: x5=np.frompyfunc(col_pred, 1,1)(dataO)
In [643]: timeit x5=np.frompyfunc(col_pred, 1,1)(dataO)
10 loops, best of 3: 197 ms per loop

np.frompyfunc采用数组（或数组）应用广播并评估'标量'函数，返回一个对象数组。它由np.vectorize使用，并且通常比直接迭代提高2倍的速度。但是，这没有帮助。

numpy apply_along_axis具有不同的结果大小

3 个答案: