Python& Numpy - 创建动态的,任意的ndarray子集

时间:2014-10-03 13:14:01

标签: python numpy itertools

我正在寻找一种通用方法:

raw_data = np.array(somedata)   
filterColumn1 = raw_data[:,1]
filterColumn2 = raw_data[:,3]
cartesian_product = itertools.product(np.unique(filterColumn1), np.unique(filterColumn2))
for val1, val2 in cartesian_product:
    fixed_mask = (filterColumn1 == val1) & (filterColumn2 == val2)
    subset = raw_data[fixed_mask]

我希望能够使用任意数量的filterColumns。所以我想要的是:

filterColumns = [filterColumn1, filterColumn2, ...]
uniqueValues = map(np.unique, filterColumns)
cartesian_product = itertools.product(*uniqueValues)
for combination in cartesian_product:
    variable_mask = ????
    subset = raw_data[variable_mask]

是否有一种简单的语法可以做我想要的?否则,我应该尝试不同的方法吗?

编辑:这似乎正常工作

cartesian_product = itertools.product(*uniqueValues)
for combination in cartesian_product:

    variable_mask = True
    for idx, fc in enumerate(filterColumns):
        variable_mask &= (fc == combination[idx])

    subset = raw_data[variable_mask]

2 个答案:

答案 0 :(得分:2)

这样的东西?

variable_mask = np.ones_like(filterColumns[0])     # select all rows initially
for column, val in zip(filterColumns, combination):
    variable_mask &= (column == val)
subset = raw_data[variable_mask]

答案 1 :(得分:2)

您可以为此

使用numpy.all和index广播
filter_matrix = np.array(filterColumns)
combination_array = np.array(combination)
bool_matrix = filter_matrix == combination_array[newaxis, :]   #not sure of the newaxis position
subset = raw_data[bool_matrix]

如果您的过滤器位于矩阵内,特别是通过轴上的numpy argsortnumpy roll,则可以采用更简单的方法来执行相同的操作。首先,滚动轴直到您的轴,直到您将过滤器作为第一列进行排序,然后对它们进行排序并垂直切割数组以获得矩阵的其余部分。

通常,如果在Python中可以避免使用for循环,那么最好避免使用它。

<强>更新

以下是没有for循环的完整代码:

import numpy as np

# select filtering indexes
filter_indexes = [1, 3]
# generate the test data
raw_data = np.random.randint(0, 4, size=(50,5))


# create a column that we would use for indexing
index_columns = raw_data[:, filter_indexes]

# sort the index columns by lexigraphic order over all the indexing columns
argsorts = np.lexsort(index_columns.T)

# sort both the index and the data column
sorted_index = index_columns[argsorts, :]
sorted_data = raw_data[argsorts, :]

# in each indexing column, find if number in row and row-1 are identical
# then group to check if all numbers in corresponding positions in row and row-1 are identical
autocorrelation = np.all(sorted_index[1:, :] == sorted_index[:-1, :], axis=1)

# find out the breakpoints: these are the positions where row and row-1 are not identical
breakpoints = np.nonzero(np.logical_not(autocorrelation))[0]+1

# finally find the desired subsets 
subsets = np.split(sorted_data, breakpoints)

另一种实现方法是将索引矩阵转换为字符串矩阵,逐行求和,在现在唯一的索引列上获取argsort并按上述方式拆分。

对于Conveniece,首先滚动索引矩阵可能会更有趣,直到它们都在矩阵的开头,这样就可以清楚地完成上面的排序了。