从聚合数据框创建新数据框

时间:2015-12-10 06:19:16

标签: python pandas dataframe

我有一个按位置聚合人的数据框

location_id | score | number_of_males | number_of_females
     1      |  20   |        2        |         1
     2      |  45   |        1        |         2

我想创建一个新的数据框,这个数据框没有聚合这个,所以我得到像

这样的东西
location_id | score | number_of_males | number_of_females
     1      |  20   |        1        |         0
     1      |  20   |        1        |         0
     1      |  20   |        0        |         1
     2      |  45   |        1        |         0
     2      |  45   |        0        |         1
     2      |  45   |        0        |         0

甚至更好

location_id | score |       sex 
     1      |  20   |       male       
     1      |  20   |       male    
     1      |  20   |       female
     2      |  45   |       male
     2      |  45   |       female
     2      |  45   |       female

我想做点什么

import pandas as pd
aggregated_df = pd.DataFrame.from_csv(SOME_PATH)
unaggregated_df = df = pd.DataFrame(columns=['location_id', 'score', 'sex'])

for row in aggregated_df:
  for column in ['number_of_males', 'number_of_females']:
    for number_of_people in range(0, row[column]):
      if column == 'number_of_males':
        sex = 'male'
      else:
        sex = 'female'
      unaggregated_df.append([{'location_id': row['location_id'],
                              'score': row['score'],
                              'sex': sex}],
                             ignore_index=True)

即使pandas

支持这似乎得到支持,我也无法将字典附加到其中

是否有更多pandthonic(熊猫版本的pythonic)方法来实现这一目标?

2 个答案:

答案 0 :(得分:2)

以下是使用group_by获取结果的方法:

ids = ['location_id','score']

def foo(d):
    return pd.Series(d['number_of_males'].values*['male'] + 
                     d['number_of_females'].values*['female'])

pd.melt(df.groupby(ids).apply(foo).reset_index(), id_vars=ids).drop('variable', 1)

#Out[13]:
#   location_id  score   value
#0            1     20    male
#1            2     45    male
#2            1     20    male
#3            2     45  female
#4            1     20  female
#5            2     45  female

答案 1 :(得分:0)

直到这个我可以做一个熊猫功能

print df
location_id  score  number_of_males  number_of_females
     1        20           2                 1
     2        45           1                 2

将两列转换为一列,

df.set_index(['location_id','score']).stack().reset_index()
Out[102]: 
   location_id  score            level_2  0
0            1     20    number_of_males  2
1            1     20  number_of_females  1
2            2     45    number_of_males  1
3            2     45  number_of_females  2

但是我必须使用python循环迭代来增加行数:(