Question

我有一个约19000行，3列（X，Y，Z）的数据框，并且我试图掩盖该数据框，以便具有X_max> X> = X_min，Y_max> Y> Y_min的数据，和Z_max> Z> Z_min。

在此示例中，

df['X'] is 0.0, 0.1, 0.2, 0.3, ..., 5.0
df['Y'] is -3.0, -2.9, -2.8, ..., 3.0
df['Z'] is -2.0, -1.9, ..., -1.5

所以行数是51 * 61 * 6 = 18666

创建遮罩条件时，大约需要1秒钟。

cond1 = df['X']>=X_min

我有以下6个条件，创建6个条件大约需要3-3.5秒。

start1 = time()
cond1 = df['X']>=X_min
cond2 = df['X']>=X_max
cond3 = df['X']>=Y_min
cond4 = df['X']>=Y_max
cond5 = df['X']>=Z_min
cond6 = df['X']>=Z_max
finish1 = time()
print(finish1 - start1)  # this is about 3-3.5 sec

start2 = time()
df2= df[conjunction(cond1&cond2&cond3&cond4&cond5&cond6)] does not take long.
finish2 = time()
print(finish2 - start2)  # this is about 0.002 sec

顺便说一下，下面的代码花费了类似的时间（3-3.5秒）。

df2 = df[(df['X']>=X_min)&(df['X']<X_max)&(df['Y']>=Y_min)&(df['Y']<Y_max)&(df['Z']>=Z_min)&(df['Z']<Z_max)]

如何提高速度？我可以通过保留熊猫数据框来使其更快吗？

Answer 1

您可能需要运行df.info()来仔细检查列的数据类型。比较数值时应该快得多。如果列是字符串，那会慢很多。

Answer 2

熊猫.query往往比通常的索引编制更快。

熊猫数据帧的屏蔽（过滤）太慢

2 个答案: