说,我已经给出了一个DataFrame,其中大部分列都是分类数据。
> data.head()
age risk sex smoking
0 28 no male no
1 58 no female no
2 27 no male yes
3 26 no male no
4 29 yes female yes
我想通过这些分类变量的键值对字典对这些数据进行子集化。
tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'}
因此,我想拥有以下子集。
data[ (data.risk == 'no') & (data.smoking == 'yes') & (data.sex == 'female')]
我想做的是:
data[tmp]
这样做最蟒蛇/熊猫的方法是什么?
最小例子:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
x = Series(random.randint(0,2,50), dtype='category')
x.cat.categories = ['no', 'yes']
y = Series(random.randint(0,2,50), dtype='category')
y.cat.categories = ['no', 'yes']
z = Series(random.randint(0,2,50), dtype='category')
z.cat.categories = ['male', 'female']
a = Series(random.randint(20,60,50), dtype='category')
data = DataFrame({'risk':x, 'smoking':y, 'sex':z, 'age':a})
tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'}
答案 0 :(得分:3)
您可以从字典中创建一个查找数据框,然后使用data
进行内部联接,效果与query
相同:
from pandas import merge, DataFrame
merge(DataFrame(tmp, index =[0]), data)
答案 1 :(得分:3)
我会使用.query()方法执行此任务:
In [103]: qry = ' and '.join(["{} == '{}'".format(k,v) for k,v in tmp.items()])
In [104]: qry
Out[104]: "sex == 'female' and risk == 'no' and smoking == 'yes'"
In [105]: data.query(qry)
Out[105]:
age risk sex smoking
7 24 no female yes
22 43 no female yes
23 42 no female yes
25 24 no female yes
32 29 no female yes
40 34 no female yes
43 35 no female yes
答案 2 :(得分:3)
import numpy as np
import pandas as pd
np.random.seed(123)
x = pd.Series(np.random.randint(0,2,10), dtype='category')
x.cat.categories = ['no', 'yes']
y = pd.Series(np.random.randint(0,2,10), dtype='category')
y.cat.categories = ['no', 'yes']
z = pd.Series(np.random.randint(0,2,10), dtype='category')
z.cat.categories = ['male', 'female']
a = pd.Series(np.random.randint(20,60,10), dtype='category')
data = pd.DataFrame({'risk':x, 'smoking':y, 'sex':z, 'age':a})
print (data)
age risk sex smoking
0 24 no male yes
1 23 yes male yes
2 22 no female no
3 40 no female yes
4 59 no female no
5 22 no male yes
6 40 no female no
7 27 yes male yes
8 55 yes male yes
9 48 no male no
tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'}
mask = pd.concat([data[x[0]].eq(x[1]) for x in tmp.items()], axis=1).all(axis=1)
print (mask)
0 False
1 False
2 False
3 True
4 False
5 False
6 False
7 False
8 False
9 False
dtype: bool
df1 = data[mask]
print (df1)
age risk sex smoking
3 40 no female yes
L = [(x[0], x[1]) for x in tmp.items()]
print (L)
[('smoking', 'yes'), ('sex', 'female'), ('risk', 'no')]
L = pd.concat([data[x[0]].eq(x[1]) for x in tmp.items()], axis=1)
print (L)
smoking sex risk
0 True False True
1 True False False
2 False True True
3 True True True
4 False True True
5 True False True
6 False True True
7 True False False
8 True False False
9 False False True
计时:
len(data)=1M
。
N = 1000000
np.random.seed(123)
x = pd.Series(np.random.randint(0,2,N), dtype='category')
x.cat.categories = ['no', 'yes']
y = pd.Series(np.random.randint(0,2,N), dtype='category')
y.cat.categories = ['no', 'yes']
z = pd.Series(np.random.randint(0,2,N), dtype='category')
z.cat.categories = ['male', 'female']
a = pd.Series(np.random.randint(20,60,N), dtype='category')
data = pd.DataFrame({'risk':x, 'smoking':y, 'sex':z, 'age':a})
#[1000000 rows x 4 columns]
print (data)
tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'}
In [133]: %timeit (data[pd.concat([data[x[0]].eq(x[1]) for x in tmp.items()], axis=1).all(axis=1)])
10 loops, best of 3: 89.1 ms per loop
In [134]: %timeit (data.query(' and '.join(["{} == '{}'".format(k,v) for k,v in tmp.items()])))
1 loop, best of 3: 237 ms per loop
In [135]: %timeit (pd.merge(pd.DataFrame(tmp, index =[0]), data.reset_index()).set_index('index'))
1 loop, best of 3: 256 ms per loop
答案 3 :(得分:2)
您可以构建一个检查这些属性的布尔矢量。可能是更好的方式:
df[risk == 'no' and smoking == 'yes' and sex == 'female' for (age, risk, sex, smoking) in df.itertuples()]
答案 4 :(得分:0)
我认为您可以在数据框中使用to_dict
方法,然后使用列表推导进行过滤:
df = pd.DataFrame(data={'age':[28, 29], 'sex':["M", "F"], 'smoking':['y', 'n']})
print df
tmp = {'age': 28, 'smoking': 'y', 'sex': 'M'}
print pd.DataFrame([i for i in df.to_dict('records') if i == tmp])
>>> age sex smoking
0 28 M y
1 29 F n
age sex smoking
0 28 M y
您还可以将tmp转换为系列:
ts = pd.Series(tmp)
print pd.DataFrame([i[1] for i in df.iterrows() if i[1].equals(ts)])