我希望能够通过在数据帧的一列中按所需值索引数据帧来制作数据帧的较小子集。
代码
postactions
我发现了如何做:
import pandas as pd
import numpy as np
data = [['Alex',15,4],['Bob',5,1],['Clarke',13,2],['dan',6,2],['eve',19,1],['fin',12,1],['ginny',11,2],['hal',14,1],['ian',13,3],['jen',9,1] ]
df = pd.DataFrame(data,columns=['Name','Age','Pets'])
print (df)
lo = 10
hi = 14
lo_range = df[df['Age']>=lo]
print('lo_range:', lo_range)
mid_range = df[hi>= df['Age']>=lo]
print('mid_range:', mid_range)
答案 0 :(得分:1)
另一种方法是将lambda与apply一起使用:
mid_range = df[df['Age'].apply(lambda x: x in range(lo,hi+1))]
在测量通过lambda和&
运算符使用apply的执行时间时,
我注意到应用lambda的速度更快!
start_time = time.time()
mid_range = df[df['Age'].apply(lambda x: x in range(lo,hi+1))]
end_time = time.time()
'mid_range:', Name Age Pets
2 Clarke 13 2
5 fin 12 1
6 ginny 11 2
7 hal 14 1
8 ian 13 3
"execution time(sec): 0.0006139278411865234"
start_time = time.time()
mid_range = df[( df['Age']>=lo) & (df['Age']<=hi)]
end_time = time.time()
'mid_range:', Name Age Pets
2 Clarke 13 2
5 fin 12 1
6 ginny 11 2
7 hal 14 1
8 ian 13 3
"execution time(sec): 0.0015518665313720703"
因此,如果数据框中只有几个条目可能并不重要,但是如果表中恰好有几百万个条目,那确实可以有所作为!
答案 1 :(得分:1)
您可以使用.drop()和布尔值:
import pandas as pd
data = [['Alex',15,4],['Bob',5,1],['Clarke',13,2],['dan',6,2],['eve',19,1],['fin',12,1],['ginny',11,2],['hal',14,1],['ian',13,3],['jen',9,1] ]
df = pd.DataFrame(data,columns=['Name','Age','Pets'])
print (df)
lo = 10
hi = 14
lo_range = df.drop(df[(df["Age"] >= lo)].index)
print('lo_range:', lo_range)
mid_range = df.drop(df[(df["Age"] >= hi) | (df["Age"] < lo)].index)
print('mid_range:', mid_range)
high_range = df.drop(df[(df["Age"] < hi)].index)
print('high_range:', high_range)
打印中档时,您会得到:
print('mid_range:', mid_range)
mid_range: Name Age Pets
2 Clarke 13 2
5 fin 12 1
6 ginny 11 2
8 ian 13 3
编辑:上面给出的解决方案(mid_range = df[( df['Age']>=lo) & (df['Age']>=hi)]
)返回
mid_range: Name Age Pets
0 Alex 15 4
4 eve 19 1
7 hal 14 1
我猜这是高范围,因此无法正常工作。