我有一个这样的数据框:
array([[1374495, 3, 'prior', ..., 16.0, 'soy lactosefree', 'dairy eggs'],
[3002854, 3, 'prior', ..., 16.0, 'soy lactosefree', 'dairy eggs'],
[2710558, 3, 'prior', ..., 16.0, 'soy lactosefree', 'dairy eggs'],
...,
[1355976, 206200, 'prior', ..., 16.0, 'soy lactosefree',
'dairy eggs'],
[1909878, 206200, 'prior', ..., 16.0, 'soy lactosefree',
'dairy eggs'],
[943915, 206200, 'train', ..., 16.0, 'soy lactosefree', 'dairy eggs']], dtype=object)
每一行的第一个数字是orderid,如1374495, 3002854, 2710558...
现在我有一个orderid列表,它将用于从数组中获取行。例如,要使用的列表是[1355976, 1909878, 943915 ]
,我应该从[1355976, 1909878, 943915 ]
中orderid的数组中选择行。我怎样才能以有效的方式实现这一目标?
答案 0 :(得分:1)
方法#1
这是基于times -
的一种方法def filter_rows(a, idx):
# a is input dataframe as array
# idx is list of indices for selecting rows
a_idx = a[:,0]
idx_arr = np.sort(idx)
pos_idx = np.searchsorted(idx_arr, a_idx)
pos_idx[pos_idx == idx_arr.size] = 0
mask = idx_arr[pos_idx] == a_idx
out = a[mask]
return out
方法#2
这是另一个np.searchsorted
-
a[np.in1d(a[:,0], idx)]
样品运行 -
In [83]: a
Out[83]:
array([[1374495, 3, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
[3002854, 3, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
[2710558, 3, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
[1355976, 206200, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
[1909878, 206200, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
[943915, 206200, 'train', 16.0, 'soy lactosefree', 'dairy eggs']])
In [84]: idx
Out[84]: [1355976, 1909878, 943915]
In [85]: filter_rows(a, idx)
Out[85]:
array([[1355976, 206200, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
[1909878, 206200, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
[943915, 206200, 'train', 16.0, 'soy lactosefree', 'dairy eggs']])
In [88]: a[np.in1d(a[:,0], idx)]
Out[88]:
array([[1355976, 206200, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
[1909878, 206200, 'prior', 16.0, 'soy lactosefree', 'dairy eggs'],
[943915, 206200, 'train', 16.0, 'soy lactosefree', 'dairy eggs']])
答案 1 :(得分:0)
numpy_indexed包(免责声明:我是其作者)包含这些类型操作的高效功能:
import numpy_indexed as npi
row_idx = npi.indices(id_column, ids_to_get_index_of)
应该具有与Divakar提供的解决方案相同的性能,但附带一些额外的花哨和口哨,比如kwargs选择各种处理缺失值的方法,等等。