我正在寻找一种通用方法:
raw_data = np.array(somedata)
filterColumn1 = raw_data[:,1]
filterColumn2 = raw_data[:,3]
cartesian_product = itertools.product(np.unique(filterColumn1), np.unique(filterColumn2))
for val1, val2 in cartesian_product:
fixed_mask = (filterColumn1 == val1) & (filterColumn2 == val2)
subset = raw_data[fixed_mask]
我希望能够使用任意数量的filterColumns。所以我想要的是:
filterColumns = [filterColumn1, filterColumn2, ...]
uniqueValues = map(np.unique, filterColumns)
cartesian_product = itertools.product(*uniqueValues)
for combination in cartesian_product:
variable_mask = ????
subset = raw_data[variable_mask]
是否有一种简单的语法可以做我想要的?否则,我应该尝试不同的方法吗?
编辑:这似乎正常工作
cartesian_product = itertools.product(*uniqueValues)
for combination in cartesian_product:
variable_mask = True
for idx, fc in enumerate(filterColumns):
variable_mask &= (fc == combination[idx])
subset = raw_data[variable_mask]
答案 0 :(得分:2)
这样的东西?
variable_mask = np.ones_like(filterColumns[0]) # select all rows initially
for column, val in zip(filterColumns, combination):
variable_mask &= (column == val)
subset = raw_data[variable_mask]
答案 1 :(得分:2)
您可以为此
使用numpy.all和index广播filter_matrix = np.array(filterColumns)
combination_array = np.array(combination)
bool_matrix = filter_matrix == combination_array[newaxis, :] #not sure of the newaxis position
subset = raw_data[bool_matrix]
如果您的过滤器位于矩阵内,特别是通过轴上的numpy argsort
和numpy roll
,则可以采用更简单的方法来执行相同的操作。首先,滚动轴直到您的轴,直到您将过滤器作为第一列进行排序,然后对它们进行排序并垂直切割数组以获得矩阵的其余部分。
通常,如果在Python中可以避免使用for循环,那么最好避免使用它。
<强>更新强>
以下是没有for
循环的完整代码:
import numpy as np
# select filtering indexes
filter_indexes = [1, 3]
# generate the test data
raw_data = np.random.randint(0, 4, size=(50,5))
# create a column that we would use for indexing
index_columns = raw_data[:, filter_indexes]
# sort the index columns by lexigraphic order over all the indexing columns
argsorts = np.lexsort(index_columns.T)
# sort both the index and the data column
sorted_index = index_columns[argsorts, :]
sorted_data = raw_data[argsorts, :]
# in each indexing column, find if number in row and row-1 are identical
# then group to check if all numbers in corresponding positions in row and row-1 are identical
autocorrelation = np.all(sorted_index[1:, :] == sorted_index[:-1, :], axis=1)
# find out the breakpoints: these are the positions where row and row-1 are not identical
breakpoints = np.nonzero(np.logical_not(autocorrelation))[0]+1
# finally find the desired subsets
subsets = np.split(sorted_data, breakpoints)
另一种实现方法是将索引矩阵转换为字符串矩阵,逐行求和,在现在唯一的索引列上获取argsort并按上述方式拆分。
对于Conveniece,首先滚动索引矩阵可能会更有趣,直到它们都在矩阵的开头,这样就可以清楚地完成上面的排序了。