filtering a pandas DataFrame by multiple columns

时间:2015-06-25 18:44:33

标签: python pandas

I have a pandas dataframe, and I want to filter it on some function of a number of columns - the documentation seems to only talk about single columns. I did the following, but I kind of doubt that this would be the most efficient (or the most elegant) -- the code throws out those lines from the dataframe dog where the difference between the time stamps in two of the columns is greater than a threshold value: flog = zip(dog['date1'], dog['date2']) cog = [(x[0]-x[1]).days for x in flog] dog['diff'] = cog ddog = dog[(dog['diff']<5)]

2 个答案:

答案 0 :(得分:2)

You can write your customized filter function this way. import pandas as pd import numpy as np df = pd.DataFrame(np.random.randn(100, 2), columns=['A', 'B'], index=np.random.choice(['X', 'Y', 'Z'], 100)) Out[257]: A B Y -0.6444 0.9515 Y 0.0541 0.1810 X 1.0280 -2.1507 Y 0.5513 -0.6256 X -1.4126 0.8487 Y -0.4272 -0.7669 Z -0.3358 0.8212 Z -0.0328 -1.1885 Y 0.9210 1.7363 Z 1.2619 -2.5311 .. ... ... Y 0.4495 -0.1995 Y -0.5025 0.8696 Z -0.3178 0.5244 X 1.5752 -0.1915 Z 0.2572 0.1216 X -0.5613 1.7869 Y -0.4322 1.4184 Z 0.2402 0.9258 Z -0.3328 1.7380 X -1.9155 0.0929 [100 rows x 2 columns] def my_filter(group): # say A^2 + B^2 > 1 selector = (group.A ** 2 + group.B ** 2) > 1 return group[selector] df.groupby(level=0).apply(my_filter) Out[256]: A B X X 1.0280 -2.1507 X -1.4126 0.8487 X -0.6299 0.8297 X 0.8790 -0.5672 X -2.1781 1.8232 X 0.4533 -1.1098 X 0.8996 -0.6523 X -2.6023 0.2152 X 1.5641 -1.0823 X -0.4553 1.0037 .. ... ... Z Z -0.7860 1.3643 Z 0.7350 -1.3309 Z 0.9675 -0.9975 Z -1.0461 -0.8538 Z -0.9659 1.7430 Z -0.9788 0.3100 Z 1.6457 1.7855 Z -2.0771 0.4892 Z 0.0399 -1.6994 Z -0.3328 1.7380 [61 rows x 2 columns] We've removed 39 rows (from 100 to 61).

答案 1 :(得分:2)

从DataFrame中选择您的列,然后应用您的函数(可能是lambda表达式,具体取决于用法)。

mask = dog[['date1', 'date2']].apply(lambda x: abs(x[0] - x[1]).days < 5, axis=1)
>>> dog[mask]

举例说明:

df = pd.DataFrame({'date1': pd.date_range(start='2015-1-1', periods=10),
                   'date2': pd.date_range(start='2015-1-1', periods=10)[::-1]})
mask = df[['date1', 'date2']].apply(lambda x: abs(x[0] - x[1]).days < 5, axis=1)

>>> df
       date1      date2
0 2015-01-01 2015-01-10
1 2015-01-02 2015-01-09
2 2015-01-03 2015-01-08
3 2015-01-04 2015-01-07
4 2015-01-05 2015-01-06
5 2015-01-06 2015-01-05
6 2015-01-07 2015-01-04
7 2015-01-08 2015-01-03
8 2015-01-09 2015-01-02
9 2015-01-10 2015-01-01

>>> df[mask]
       date1      date2
3 2015-01-04 2015-01-07
4 2015-01-05 2015-01-06
5 2015-01-06 2015-01-05
6 2015-01-07 2015-01-04

鉴于新的日期过滤了DataFrame,您可以继续进行分析。