如何使用dict对DataFrame进行子集化?

时间:2016-10-18 15:01:31

标签: python pandas dataframe categorical-data

说,我已经给出了一个DataFrame,其中大部分列都是分类数据。

> data.head()
  age risk     sex smoking
0  28   no    male      no
1  58   no  female      no
2  27   no    male     yes
3  26   no    male      no
4  29  yes  female     yes

我想通过这些分类变量的键值对字典对这些数据进行子集化。

tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'}

因此,我想拥有以下子集。

data[ (data.risk == 'no') & (data.smoking == 'yes') & (data.sex == 'female')]

我想做的是:

data[tmp]

这样做最蟒蛇/熊猫的方法是什么?

最小例子:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

x = Series(random.randint(0,2,50), dtype='category')
x.cat.categories = ['no', 'yes']

y = Series(random.randint(0,2,50), dtype='category')
y.cat.categories = ['no', 'yes']

z = Series(random.randint(0,2,50), dtype='category')
z.cat.categories = ['male', 'female']

a = Series(random.randint(20,60,50), dtype='category')

data = DataFrame({'risk':x, 'smoking':y, 'sex':z, 'age':a})

tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'}

5 个答案:

答案 0 :(得分:3)

您可以从字典中创建一个查找数据框,然后使用data进行内部联接,效果与query相同:

from pandas import merge, DataFrame
merge(DataFrame(tmp, index =[0]), data)

enter image description here

答案 1 :(得分:3)

我会使用.query()方法执行此任务:

In [103]: qry = ' and '.join(["{} == '{}'".format(k,v) for k,v in tmp.items()])

In [104]: qry
Out[104]: "sex == 'female' and risk == 'no' and smoking == 'yes'"

In [105]: data.query(qry)
Out[105]:
   age risk     sex smoking
7   24   no  female     yes
22  43   no  female     yes
23  42   no  female     yes
25  24   no  female     yes
32  29   no  female     yes
40  34   no  female     yes
43  35   no  female     yes

答案 2 :(得分:3)

您可以将列表理解与concatall一起使用:

import numpy as np
import pandas as pd

np.random.seed(123)
x = pd.Series(np.random.randint(0,2,10), dtype='category')
x.cat.categories = ['no', 'yes']
y = pd.Series(np.random.randint(0,2,10), dtype='category')
y.cat.categories = ['no', 'yes']
z = pd.Series(np.random.randint(0,2,10), dtype='category')
z.cat.categories = ['male', 'female']

a = pd.Series(np.random.randint(20,60,10), dtype='category')

data = pd.DataFrame({'risk':x, 'smoking':y, 'sex':z, 'age':a})
print (data)
  age risk     sex smoking
0  24   no    male     yes
1  23  yes    male     yes
2  22   no  female      no
3  40   no  female     yes
4  59   no  female      no
5  22   no    male     yes
6  40   no  female      no
7  27  yes    male     yes
8  55  yes    male     yes
9  48   no    male      no
tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'}
mask = pd.concat([data[x[0]].eq(x[1]) for x in tmp.items()], axis=1).all(axis=1)
print (mask)
0    False
1    False
2    False
3     True
4    False
5    False
6    False
7    False
8    False
9    False
dtype: bool

df1 = data[mask]
print (df1)
 age risk     sex smoking
3  40   no  female     yes
L = [(x[0], x[1]) for x in tmp.items()]
print (L)
[('smoking', 'yes'), ('sex', 'female'), ('risk', 'no')]

L = pd.concat([data[x[0]].eq(x[1]) for x in tmp.items()], axis=1)
print (L)
  smoking    sex   risk
0    True  False   True
1    True  False  False
2   False   True   True
3    True   True   True
4   False   True   True
5    True  False   True
6   False   True   True
7    True  False  False
8    True  False  False
9   False  False   True

计时

len(data)=1M

N = 1000000
np.random.seed(123)
x = pd.Series(np.random.randint(0,2,N), dtype='category')
x.cat.categories = ['no', 'yes']
y = pd.Series(np.random.randint(0,2,N), dtype='category')
y.cat.categories = ['no', 'yes']
z = pd.Series(np.random.randint(0,2,N), dtype='category')
z.cat.categories = ['male', 'female']

a = pd.Series(np.random.randint(20,60,N), dtype='category')

data = pd.DataFrame({'risk':x, 'smoking':y, 'sex':z, 'age':a})

#[1000000 rows x 4 columns]
print (data)


tmp = {'risk':'no', 'smoking':'yes', 'sex':'female'}


In [133]: %timeit (data[pd.concat([data[x[0]].eq(x[1]) for x in tmp.items()], axis=1).all(axis=1)])
10 loops, best of 3: 89.1 ms per loop

In [134]: %timeit (data.query(' and '.join(["{} == '{}'".format(k,v) for k,v in tmp.items()])))
1 loop, best of 3: 237 ms per loop

In [135]: %timeit (pd.merge(pd.DataFrame(tmp, index =[0]), data.reset_index()).set_index('index'))
1 loop, best of 3: 256 ms per loop

答案 3 :(得分:2)

您可以构建一个检查这些属性的布尔矢量。可能是更好的方式:

df[risk == 'no' and smoking == 'yes' and sex == 'female' for (age, risk, sex, smoking) in df.itertuples()]

答案 4 :(得分:0)

我认为您可以在数据框中使用to_dict方法,然后使用列表推导进行过滤:

df = pd.DataFrame(data={'age':[28, 29], 'sex':["M", "F"], 'smoking':['y', 'n']})
print df
tmp = {'age': 28, 'smoking': 'y', 'sex': 'M'}

print pd.DataFrame([i for i in df.to_dict('records') if i == tmp])


>>>    age sex smoking
0   28   M       y
1   29   F       n

   age sex smoking
0   28   M       y

您还可以将tmp转换为系列:

ts = pd.Series(tmp)

print pd.DataFrame([i[1] for i in df.iterrows() if i[1].equals(ts)])