如何通过两个限制之间的列值索引数据框

时间:2020-09-29 13:57:52

标签: python pandas dataframe indexing

我希望能够通过在数据帧的一列中按所需值索引数据帧来制作数据帧的较小子集。

代码

postactions

我发现了如何做:

import pandas as pd 
import numpy as np

data = [['Alex',15,4],['Bob',5,1],['Clarke',13,2],['dan',6,2],['eve',19,1],['fin',12,1],['ginny',11,2],['hal',14,1],['ian',13,3],['jen',9,1] ]
df = pd.DataFrame(data,columns=['Name','Age','Pets'])
print (df)

lo = 10
hi = 14
lo_range = df[df['Age']>=lo]
print('lo_range:', lo_range)
mid_range = df[hi>= df['Age']>=lo]
print('mid_range:', mid_range)

2 个答案:

答案 0 :(得分:1)

另一种方法是将lambda与apply一起使用:

mid_range = df[df['Age'].apply(lambda x: x in range(lo,hi+1))]

在测量通过lambda和&运算符使用apply的执行时间时, 我注意到应用lambda的速度更快!

start_time = time.time()
mid_range = df[df['Age'].apply(lambda x: x in range(lo,hi+1))]
end_time = time.time()

'mid_range:',      Name  Age  Pets
2  Clarke   13     2
5     fin   12     1
6   ginny   11     2
7     hal   14     1
8     ian   13     3

"execution time(sec): 0.0006139278411865234"


start_time = time.time()
mid_range =  df[( df['Age']>=lo) & (df['Age']<=hi)]
end_time = time.time()

'mid_range:',      Name  Age  Pets
2  Clarke   13     2
5     fin   12     1
6   ginny   11     2
7     hal   14     1
8     ian   13     3

"execution time(sec): 0.0015518665313720703"

因此,如果数据框中只有几个条目可能并不重要,但是如果表中恰好有几百万个条目,那确实可以有所作为!

答案 1 :(得分:1)

您可以使用.drop()和布尔值:

import pandas as pd 

data = [['Alex',15,4],['Bob',5,1],['Clarke',13,2],['dan',6,2],['eve',19,1],['fin',12,1],['ginny',11,2],['hal',14,1],['ian',13,3],['jen',9,1] ]
df = pd.DataFrame(data,columns=['Name','Age','Pets'])
print (df)

lo = 10
hi = 14

lo_range = df.drop(df[(df["Age"] >= lo)].index)
print('lo_range:', lo_range)

mid_range = df.drop(df[(df["Age"] >= hi) | (df["Age"] < lo)].index)
print('mid_range:', mid_range)

high_range = df.drop(df[(df["Age"] < hi)].index)
print('high_range:', high_range)

打印中档时,您会得到:

print('mid_range:', mid_range)
mid_range:      Name  Age  Pets
2  Clarke   13     2
5     fin   12     1
6   ginny   11     2
8     ian   13     3

编辑:上面给出的解决方案(mid_range = df[( df['Age']>=lo) & (df['Age']>=hi)])返回

mid_range:    Name  Age  Pets
0  Alex   15     4
4   eve   19     1
7   hal   14     1

我猜这是高范围,因此无法正常工作。