pandas DataFrame和pandas.groupby来计算工资

时间:2016-09-07 22:51:50

标签: python pandas dataframe ipython

对于我的任务,我需要将棒球工资数据导入到大熊猫DataFrame中 从那里,我的目标之一是每年获得所有球队的薪水。

我成功了但是为了进入下一个任务,我需要一只熊猫DataFramesumofSalaries.dtype正在返回int64

问题:
    1.如何将以下代码中的数据转换为DataFrame?
    2.如何删除sumofSalaries中的索引?

代码:

 import pandas as pd
 salariesData = pd.read_csv('Salaries.csv')

 #sum salaries by year and team
 sumOfSalaries = salariesData.groupby(by=['yearID','teamID'])['salary'].sum()

 del sumOfSalaries.index.names #line giving me errors

 #create DataFrame from grouped data 
 df = pd.DataFrame(sumOfSalaries, columns = ['yearID', 'teamID', 'salary'])
 df

 _____________________________________________________________________________

 sumofSalaries:
 yearID  teamID
 1985    ATL        14807000
         BAL        11560712
         BOS        10897560
         CAL        14427894
         CHA         9846178

 ...and so on
 _____________________________________________________________________________

  df:

            yearID  teamID  salary
 yearID teamID          
 1985   ATL NaN NaN 14807000
        BAL NaN NaN 11560712
        BOS NaN NaN 10897560
        CAL NaN NaN 14427894

2 个答案:

答案 0 :(得分:1)

del在python中有一个very specific meaning,对这样的数据框没有用处。

你想使用reset_index摆脱群组之后的MultiIndex - 如果你想要摆脱MultiIndex,那就是。

import pandas as pd
salariesData = pd.read_csv('Salaries.csv')

#sum salaries by year and team
sumOfSalaries = (pd.DataFrame(
                 salariesData.groupby(by=['yearID','teamID'])['salary'].sum()
                 .reset_index()
               ))

阅读groupby docsmultiindexing docs了解详情。

答案 1 :(得分:0)

我认为您只需要将参数as_index=False添加到groupby,输出为DataFrame而不会MultiIndex

sumOfSalaries = salariesData.groupby(by=['yearID','teamID'], as_index=False)['salary'].sum()

样品:

import pandas as pd

salariesData = pd.DataFrame({
'yearID': {0: 1985, 1: 1985, 2: 1985, 3: 1985, 4: 1985, 5: 1986, 6: 1986, 7: 1986, 8: 1987, 9: 1987}, 
'teamID': {0: 'ATL', 1: 'ATL', 2: 'ATL', 3: 'CAL', 4: 'CAL', 5: 'CAL', 6: 'CAL', 7: 'BOS', 8: 'BOS', 9: 'BOS'}, 
'salary': {0: 10, 1: 20, 2: 30, 3: 40, 4: 50, 5: 10, 6: 20, 7: 30, 8: 40, 9: 50}
},
columns = ['yearID','teamID','salary']
)

print (salariesData)
   yearID teamID  salary
0    1985    ATL      10
1    1985    ATL      20
2    1985    ATL      30
3    1985    CAL      40
4    1985    CAL      50
5    1986    CAL      10
6    1986    CAL      20
7    1986    BOS      30
8    1987    BOS      40
9    1987    BOS      50

sumOfSalaries = salariesData.groupby(by=['yearID','teamID'], as_index=False)['salary'].sum()

print (sumOfSalaries)
   yearID teamID  salary
0    1985    ATL      60
1    1985    CAL      90
2    1986    BOS      30
3    1986    CAL      30
4    1987    BOS      90

此外,如果需要删除索引名称,请使用分配给(None, None),但如果使用上述解决方案,则没有必要:

sumOfSalaries.index.names = (None, None)

样品:

sumOfSalaries = salariesData.groupby(by=['yearID','teamID'])['salary'].sum()
sumOfSalaries.index.names = (None, None)

print (sumOfSalaries)

1985  ATL    60
      CAL    90
1986  BOS    30
      CAL    30
1987  BOS    90
Name: salary, dtype: int64