如何根据numpy数组中的列表选择元素?

时间:2017-06-29 06:46:38

标签: numpy

我有一个这样的数据框:

array([[1374495, 3, 'prior', ..., 16.0, 'soy lactosefree', 'dairy eggs'],
       [3002854, 3, 'prior', ..., 16.0, 'soy lactosefree', 'dairy eggs'],
       [2710558, 3, 'prior', ..., 16.0, 'soy lactosefree', 'dairy eggs'],
       ...,
       [1355976, 206200, 'prior', ..., 16.0, 'soy lactosefree',
        'dairy eggs'],
       [1909878, 206200, 'prior', ..., 16.0, 'soy lactosefree',
        'dairy eggs'],
       [943915, 206200, 'train', ..., 16.0, 'soy lactosefree', 'dairy eggs']], dtype=object)

每一行的第一个数字是orderid,如1374495, 3002854, 2710558...现在我有一个orderid列表,它将用于从数组中获取行。例如,要使用的列表是[1355976, 1909878, 943915 ],我应该从[1355976, 1909878, 943915 ]中orderid的数组中选择行。我怎样才能以有效的方式实现这一目标?

2 个答案:

答案 0 :(得分:1)

方法#1

这是基于times -

的一种方法
def filter_rows(a, idx):
    # a is input dataframe as array
    # idx is list of indices for selecting rows

    a_idx = a[:,0]
    idx_arr = np.sort(idx)
    pos_idx = np.searchsorted(idx_arr, a_idx)
    pos_idx[pos_idx == idx_arr.size] = 0
    mask = idx_arr[pos_idx] == a_idx
    out = a[mask]
    return out

方法#2

这是另一个np.searchsorted -

a[np.in1d(a[:,0], idx)]

样品运行 -

In [83]: a
Out[83]: 
array([[1374495, 3, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
       [3002854, 3, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
       [2710558, 3, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
       [1355976, 206200, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
       [1909878, 206200, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
       [943915, 206200, 'train', 16.0, 'soy lactosefree', 'dairy eggs']])

In [84]: idx
Out[84]: [1355976, 1909878, 943915]

In [85]: filter_rows(a, idx)
Out[85]: 
array([[1355976, 206200, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
       [1909878, 206200, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
       [943915, 206200, 'train', 16.0, 'soy lactosefree', 'dairy eggs']])

In [88]: a[np.in1d(a[:,0], idx)]
Out[88]: 
array([[1355976, 206200, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
       [1909878, 206200, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
       [943915, 206200, 'train', 16.0, 'soy lactosefree', 'dairy eggs']])

答案 1 :(得分:0)

numpy_indexed包(免责声明:我是其作者)包含这些类型操作的高效功能:

import numpy_indexed as npi
row_idx = npi.indices(id_column, ids_to_get_index_of)

应该具有与Divakar提供的解决方案相同的性能,但附带一些额外的花哨和口哨,比如kwargs选择各种处理缺失值的方法,等等。