问题

Question

问题

我正在寻找删除一组行距的最快方法，该行集包含我已获得的索引，或者从大型Pandas DataFrame中获得这些索引的差异的子集（导致同一数据集）。

到目前为止，我有两种解决方案，对我来说似乎比较慢：

df.loc[df.difference(indices)]

这需要大约115秒的时间
df.drop(indices)

这需要大约215秒的时间

有更快的方法吗？最好在熊猫里。

建议解决方案的性能

〜41秒：df[~df.index.isin(indices)] by @jezrael

Answer 1

我相信您可以创建布尔掩码，通过~进行反转并通过boolean indexing进行过滤：

df1 = df[~df.index.isin(indices)]

如@ user3471881所述，如果打算稍后再处理过滤后的df，则避免链接索引是必要的，添加copy：

df1 = df[~df.index.isin(indices)].copy()

此过滤取决于匹配索引的数量以及DataFrame的长度。

因此，另一种可能的解决方案是创建索引array/list进行保留，然后不必进行反转：

df1 = df[df.index.isin(need_indices)]

Answer 2

使用iloc（或loc，见下文）和Series.drop：

df = pd.DataFrame(np.arange(0, 1000000, 1))
indices = np.arange(0, 1000000, 3)

%timeit -n 100 df[~df.index.isin(indices)]
%timeit -n 100 df.iloc[df.index.drop(indices)]

41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

@jezrael指出，如果iloc是RangeIndex，则只能使用index，否则必须使用loc。但这仍然比df[df.isin()]快（请参阅下文）。

1000万行中的所有三个选项：

df = pd.DataFrame(np.arange(0, 10000000, 1))
indices = np.arange(0, 10000000, 3)

%timeit -n 10 df[~df.index.isin(indices)]
%timeit -n 10 df.iloc[df.index.drop(indices)]
%timeit -n 10 df.loc[df.index.drop(indices)]

4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

为什么超级慢的loc胜过boolean_indexing？

好吧，简短的答案是没有。 df.index.drop(indices)比~df.index.isin(indices)快很多（鉴于上面有1000万行的数据）：

%timeit -n 10 ~df.index.isin(indices)
%timeit -n 10 df.index.drop(indices)

4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

我们可以将其与boolean_indexing与iloc与loc的性能进行比较：

boolean_mask = ~df.index.isin(indices)
dropped_index = df.index.drop(indices)

%timeit -n 10 df[boolean_mask]
%timeit -n 10 df.iloc[dropped_index]
%timeit -n 10 df.loc[dropped_index]


489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Answer 3

如果不介意行顺序，则可以将它们安排在适当的位置：

n=10**7
df=pd.DataFrame(arange(4*n).reshape(n,4))
indices=np.unique(randint(0,n,size=n//2))

from numba import njit
@njit
def _dropfew(values,indices):
    k=len(values)-1
    for ind in indices[::-1]:
            values[ind]=values[k]
            k-=1

def dropfew(df,indices):
    _dropfew(df.values,indices)
    return df.iloc[:len(df)-len(indices)]

运行：

In [39]: %time df.iloc[df.index.drop(indices)]
Wall time: 1.07 s

In [40]: %time dropfew(df,indices)
Wall time: 219 ms

与Pandas中的大型DataFrame不同的最快的行删除/获取子集的方法

问题

建议解决方案的性能

3 个答案: