我有一个csv或数据框,看起来像这样,但包括几十万行:
df = {'Date': {0: '2014-01-01,
1: '2014-01-01',
2: '2014-01-01',
3: '2014-01-02',
4: '2014-01-02'},
'Name': {0: 'John',
1: 'John',
2: 'Rob',
3: 'Mel',
4: 'Rob'},
'Rank': {0: 1, 1: 3, 2: 2, 3: 5, 4: 6},
'Count': {0: 10, 1: 3, 2: 9, 3: 11, 4: 4}}
每个日期的名称都会重复,但计数和排名会发生变化。正如我现在所做的那样,不是每个日期对于这些名称中的每一个都有一行,而是我想安排我的数据框,以便每个日期都有一个值。也就是说,我希望我的桌子看起来像这样:
Date John_Rank Rob_Rank Mel_rank John_count Mel_count Rob_count
2014-01-01 ... ... ... ... ...
2014-01-02 ... ... ... ... ...
我想使用这种格式来计算排名的差异。我之前曾多次反对过这种情况,但是在很长一段时间里都没有这么多行来处理 - 我直到现在才手动完成这项工作。任何建议都将非常感谢!!
答案 0 :(得分:2)
我认为您可以将pivot_table
与默认aggfunc='mean'
:
import pandas as pd
d = {'Date': {0: '2014-01-01',
1: '2014-01-01',
2: '2014-01-01',
3: '2014-01-02',
4: '2014-01-02'},
'Name': {0: 'John',
1: 'John',
2: 'Rob',
3: 'Mel',
4: 'Rob'},
'Rank': {0: 1, 1: 3, 2: 2, 3: 5, 4: 6},
'Count': {0: 10, 1: 3, 2: 9, 3: 11, 4: 4}}
df = pd.DataFrame(d)
print df
Count Date Name Rank
0 10 2014-01-01 John 1
1 3 2014-01-01 John 3
2 9 2014-01-01 Rob 2
3 11 2014-01-02 Mel 5
4 4 2014-01-02 Rob 6
df = pd.pivot_table(df, index='Date', columns='Name')
df.columns = ['_'.join(col).strip() for col in df.columns.values]
print df
Count_John Count_Mel Count_Rob Rank_John Rank_Mel Rank_Rob
Date
2014-01-01 6.5 NaN 9 2 NaN 2
2014-01-02 NaN 11 4 NaN 5 6
或者如果您想在列中添加swaplevel
multiindex
:
df = pd.pivot_table(df, index='Date', columns='Name')
df.columns = df.columns.swaplevel(0,1)
df.columns = ['_'.join(col).strip() for col in df.columns.values]
print df
John_Count Mel_Count Rob_Count John_Rank Mel_Rank Rob_Rank
Date
2014-01-01 6.5 NaN 9 2 NaN 2
2014-01-02 NaN 11 4 NaN 5 6