我有MultiColumns:第二级重复包含Job Openings
和Hires
。我想为每个顶级列减去一个 - 但我尝试的所有内容都会让我陷入索引错误或切片错误。我该如何计算呢?
示例数据:
>>> df.head()
Out[25]:
Total nonfarm Total private
Hires Job openings Hires Job openings
date
2001-01-01 5777 5385 5419 4887
2002-01-01 4849 3759 4539 3381
2003-01-01 4971 3824 4645 3424
2004-01-01 4827 3459 4552 3153
2005-01-01 5207 3670 4876 3358
预期产出:
Out[25]:
Total nonfarm Total private
difference difference
date
2001-01-01 1234 5678
2002-01-01 1234 5678
2003-01-01 1234 5678
2004-01-01 1234 5678
2005-01-01 1234 5678
其中数字显然不正确。
为了有一个普遍适用的方式,我试图设置
def apply(group):
result = group.loc[:, pd.IndexSlice[:, 'Job openings']].div(group.loc[:, pd.IndexSlice[:, 'Hires']].values)
result.columns = pd.MultiIndex.from_product([[group.columns.get_level_values(0)[0]], ['Ratio']])
return result.values
foo = df.groupby(axis=1, level=0).apply(apply)
其中有两个问题:
.values
作弊才能正确划分 foo
不是正确的数据框:
住宿和食品服务[[0.76],[0.480349344978],[0.501388888889],[... 艺术,娱乐和娱乐[[0.558139534884],[0.46017699115],[0.2483221 ... 建设[[0.35],[0.274881516588],[0.267260579065],[...
我首先尝试返回result
,而不是result.values
,但这只会导致数据框中充满NaN
对于投票得最高的答案,我不喜欢它需要在.diff()
或.div()
- 黑客,这使得代码难以阅读,并且很难实现#39;在子级别上有两列以上。
答案 0 :(得分:3)
import pandas as pd
df = pd.DataFrame(
[
[5777, 5385, 5419, 4887],
[4849, 3759, 4539, 3381],
[4971, 3824, 4645, 3424],
[4827, 3459, 4552, 3153],
[5207, 3670, 4876, 3358],
],
index=pd.to_datetime(['2001-01-01',
'2002-01-01',
'2003-01-01',
'2004-01-01',
'2005-01-01']),
columns=pd.MultiIndex.from_tuples(
[('Total nonfarm', 'Hires'), ('Total nonfarm', 'Job Openings'),
('Total private', 'Hires'), ('Total private', 'Job Openings')]
)
)
print df
Total nonfarm Total private
Hires Job Openings Hires Job Openings
2001-01-01 5777 5385 5419 4887
2002-01-01 4849 3759 4539 3381
2003-01-01 4971 3824 4645 3424
2004-01-01 4827 3459 4552 3153
2005-01-01 5207 3670 4876 3358
尝试:
df.T.groupby(level=0).diff(-1).dropna().T
Total nonfarm Total private
Hires Hires
2001-01-01 392.0 532.0
2002-01-01 1090.0 1158.0
2003-01-01 1147.0 1221.0
2004-01-01 1368.0 1399.0
2005-01-01 1537.0 1518.0
要应用其他变换,比如比例,您可以这样做:
print df.T.groupby(level=0).apply(lambda x: np.exp(np.log(x).diff(-1))).dropna().T
Total nonfarm Total private
Hires Hires
2001-01-01 1.072795 1.108860
2002-01-01 1.289971 1.342502
2003-01-01 1.299948 1.356600
2004-01-01 1.395490 1.443704
2005-01-01 1.418801 1.452055
或者:
print df.T.groupby(level=0).apply(lambda x: x.div(x.shift(-1))).dropna().T
Total nonfarm Total private
Hires Hires
2001-01-01 1.072795 1.108860
2002-01-01 1.289971 1.342502
2003-01-01 1.299948 1.356600
2004-01-01 1.395490 1.443704
2005-01-01 1.418801 1.452055
要重命名列并与原始数据帧合并,您可以:
df2 = df.T.groupby(level=0).diff(-1).dropna().T
df2.columns = pd.MultiIndex.from_tuples(
[('Total nonfarm', 'difference'),
('Total private', 'difference')])
pd.concat([df, df2], axis=1).sort_index(axis=1)
看起来像:
Total nonfarm Total private \
Hires Job Openings difference Hires Job Openings
2001-01-01 5777 5385 392.0 5419 4887
2002-01-01 4849 3759 1090.0 4539 3381
2003-01-01 4971 3824 1147.0 4645 3424
2004-01-01 4827 3459 1368.0 4552 3153
2005-01-01 5207 3670 1537.0 4876 3358
difference
2001-01-01 532.0
2002-01-01 1158.0
2003-01-01 1221.0
2004-01-01 1399.0
2005-01-01 1518.0
答案 1 :(得分:2)
我认为您可以使用IndexSlice:
idx = pd.IndexSlice
df[('Total private','difference')] = (df.loc[:, idx[('Total nonfarm', 'Hires')]] -
df.loc[:, idx[('Total private', 'Hires')]])
print (df)
Total nonfarm Total private
date Hires Job openings Hires Job openings difference
2001-01-01 5777 5385 5419 4887 358
2002-01-01 4849 3759 4539 3381 310
2003-01-01 4971 3824 4645 3424 326
2004-01-01 4827 3459 4552 3153 275
2005-01-01 5207 3670 4876 3358 331
如果您想要多列,可以使用修改后的piRSquared's answer - 您可以删除转置:
print (df.groupby(level=0,axis=1).diff(-1).dropna(1))
Total nonfarm Total private
date Hires Hires Job openings
2001-01-01 392.0 532.0 4495.0
2002-01-01 1090.0 1158.0 2291.0
2003-01-01 1147.0 1221.0 2277.0
2004-01-01 1368.0 1399.0 1785.0
2005-01-01 1537.0 1518.0 1821.0
答案 2 :(得分:1)
让我们保持简单。
In [19]: df['Total nonfarm'] - df['Total private']
Out[19]:
Hires Job Openings
2001-01-01 358 498
2002-01-01 310 378
2003-01-01 326 400
2004-01-01 275 306
2005-01-01 331 312
答案 3 :(得分:1)
另一种方法是交换列级别并使用列访问器。
import pandas as pd
df = pd.DataFrame(
[
[5777, 5385, 5419, 4887],
[4849, 3759, 4539, 3381],
[4971, 3824, 4645, 3424],
[4827, 3459, 4552, 3153],
[5207, 3670, 4876, 3358],
],
index=pd.to_datetime(['2001-01-01',
'2002-01-01',
'2003-01-01',
'2004-01-01',
'2005-01-01']),
columns=pd.MultiIndex.from_tuples(
[('Total nonfarm', 'Hires'), ('Total nonfarm', 'Job Openings'),
('Total private', 'Hires'), ('Total private', 'Job Openings')]
)
)
print df
Total nonfarm Total private
Hires Job Openings Hires Job Openings
2001-01-01 5777 5385 5419 4887
2002-01-01 4849 3759 4539 3381
2003-01-01 4971 3824 4645 3424
2004-01-01 4827 3459 4552 3153
2005-01-01 5207 3670 4876 3358
如果我们交换等级然后排序,它看起来像:
print df.swaplevel(0, 1, 1).sort_index(axis=1)
Hires Job Openings
Total nonfarm Total private Total nonfarm Total private
2001-01-01 5777 5419 5385 4887
2002-01-01 4849 4539 3759 3381
2003-01-01 4971 4645 3824 3424
2004-01-01 4827 4552 3459 3153
2005-01-01 5207 4876 3670 3358
有了这个,我们可以通过.Hires
或['Hires']
访问招聘人员。将此与您的减法需求相结合:
print df.swaplevel(0, 1, 1)['Hires']
Total nonfarm Total private
2001-01-01 5777 5419
2002-01-01 4849 4539
2003-01-01 4971 4645
2004-01-01 4827 4552
2005-01-01 5207 4876
print df.swaplevel(0, 1, 1)['Hires'] - df.swaplevel(0, 1, 1)['Job Openings']
Total nonfarm Total private
2001-01-01 392 532
2002-01-01 1090 1158
2003-01-01 1147 1221
2004-01-01 1368 1399
2005-01-01 1537 1518
把所有这些放在一起,我做了:
df_ = df.swaplevel(0, 1, 1)
_df = pd.concat([
df_,
pd.concat([df_['Hires'] - df_['Job Openings'], df_['Hires'] / df_['Job Openings']],
axis=1, keys=['Difference', 'Ratio'])
], axis=1)
df = _df.swaplevel(0, 1, 1).sort_index(axis=1)
print df
Total nonfarm Total private \
Difference Hires Job Openings Ratio Difference Hires
2001-01-01 392 5777 5385 1.072795 532 5419
2002-01-01 1090 4849 3759 1.289971 1158 4539
2003-01-01 1147 4971 3824 1.299948 1221 4645
2004-01-01 1368 4827 3459 1.395490 1399 4552
2005-01-01 1537 5207 3670 1.418801 1518 4876
Job Openings Ratio
2001-01-01 4887 1.108860
2002-01-01 3381 1.342502
2003-01-01 3424 1.356600
2004-01-01 3153 1.443704
2005-01-01 3358 1.452055
您还可以使用xs
抓取横截面。
kw = dict(axis=1, level=1)
df.xs('Hires', **kw) - df.xs('Job Openings', **kw)
Total nonfarm Total private
2001-01-01 392 532
2002-01-01 1090 1158
2003-01-01 1147 1221
2004-01-01 1368 1399
2005-01-01 1537 1518
答案 4 :(得分:1)
使用groupby
和apply
import pandas as pd
df = pd.DataFrame(
[
[5777, 5385, 5419, 4887],
[4849, 3759, 4539, 3381],
[4971, 3824, 4645, 3424],
[4827, 3459, 4552, 3153],
[5207, 3670, 4876, 3358],
],
index=pd.to_datetime(['2001-01-01',
'2002-01-01',
'2003-01-01',
'2004-01-01',
'2005-01-01']),
columns=pd.MultiIndex.from_tuples(
[('Total nonfarm', 'Hires'), ('Total nonfarm', 'Job Openings'),
('Total private', 'Hires'), ('Total private', 'Job Openings')]
)
)
print df
def diff(group):
g = group.shift().sub(group).dropna()
g.index = ['Difference']
return g
def ratio(group):
g = group.shift().div(group).dropna()
g.index = ['Ratio']
return g
def do_nothing(group):
return group
pd.concat(
[df.T.groupby(level=0).apply(f).T for f in [diff, ratio, do_nothing]],
axis=1
).sort_index(axis=1)
Total nonfarm Total private \
Difference Hires Job Openings Ratio Difference Hires
2001-01-01 392.0 5777 5385 1.07 532.0 5419
2002-01-01 1090.0 4849 3759 1.29 1158.0 4539
2003-01-01 1147.0 4971 3824 1.30 1221.0 4645
2004-01-01 1368.0 4827 3459 1.40 1399.0 4552
2005-01-01 1537.0 5207 3670 1.42 1518.0 4876
Job Openings Ratio
2001-01-01 4887 1.11
2002-01-01 3381 1.34
2003-01-01 3424 1.36
2004-01-01 3153 1.44
2005-01-01 3358 1.45