Question

此问题是根据某些列值过滤NumPy ndarray。

我有一个相当大的NumPy ndarray（300000,50），我根据某些特定列中的值对其进行过滤。我有ndtypes所以我可以按名称访问每一列。

第一列名为category_code，我需要过滤矩阵以仅返回category_code位于("A", "B", "C")的行。

结果必须是另一个NumPy ndarray，其列仍可由dtype名称访问。

以下是我现在所做的事情：

index = numpy.asarray([row['category_code'] in ('A', 'B', 'C') for row in data])
filtered_data = data[index]

列表理解如：

list = [row for row in data if row['category_code'] in ('A', 'B', 'C')]
filtered_data = numpy.asarray(list)

无效，因为我原来的dtypes已无法访问。

是否有更好/更多的Pythonic方法可以达到相同的效果？

可能看起来像的东西：

filtered_data = data.where({'category_code': ('A', 'B','C'})

谢谢！

Answer 1

您可以使用基于 NumPy 的库Pandas，它具有更常用的 ndarrays 实现：< / p>

>>> # import the library
>>> import pandas as PD

创建一些示例数据作为 python词典，其键是列名，其值是列值作为python列表;每列一个键/值对

>>> data = {'category_code': ['D', 'A', 'B', 'C', 'D', 'A', 'C', 'A'], 
            'value':[4, 2, 6, 3, 8, 4, 3, 9]}

>>> # convert to a Pandas 'DataFrame'
>>> D = PD.DataFrame(data)

要仅返回category_code为B或C的行，从概念上讲两个步骤，但可以在一行中轻松完成：

>>> # step 1: create the index 
>>> idx = (D.category_code== 'B') | (D.category_code == 'C')

>>> # then filter the data against that index:
>>> D.ix[idx]

        category_code  value
   2             B      6
   3             C      3
   6             C      3

请注意 Pandas 与 NumPy （建立Pandas的库）之间的索引差异。在NumPy中，您只需将索引放在括号内，指示使用“，”索引哪个维度，并使用“：”表示您希望其他维度中的所有值（列）：

>>>  D[idx,:]

在Pandas中，您调用数据框的 ix 方法，并将仅索引放在括号内：

>>> D.loc[idx]

Answer 2

如果您可以选择，我强烈推荐pandas：它有"column indexing" built-in以及许多其他功能。它建立在numpy上。

根据列值过滤numpy ndarray（矩阵）

2 个答案: