数据帧操作和合并python

时间:2018-01-12 04:00:56

标签: python pandas

  I have a dataframe,df1 

        inp   aco   drtn                
      2.3.6  dp   Less than 1 min         
      2.3.6  ft   5-10 min        
      2.5.9  dp   More than 1 hour        
      0.8.0  dp    1-5 min                
      2.3.6  dp   10-30 min               
      2.3.6  dp   More than 1 hour        
      0.8.0  dp   Less than 1 min         
      0.8.0  dp   1-5 min                 

应通过计算出现次数将df1按3列分组。新的数据帧df2将如下所示:

       inp   aco   drtn                count
      2.3.6  dp   Less than 1 min         1
      2.3.6  ft   5-10 min                1
      2.5.9  dp   More than 1 hour        1
      0.8.0  dp    1-5 min                2
      2.3.6  dp   10-30 min               1
      2.3.6  dp   More than 1 hour        1
      6.2.6  dp   1-5 min                 1

专栏:' drtn'应转换为新列:' convrt'。 例如,该旋转柱将看起来像:小于1分钟= 0.59分钟,大于1小时= 61分钟,1-5分钟= 5分钟,5-10分钟= 10,10-30分钟= 30分钟。另一个新专栏:' calc'应该被定义为' count'的值。列乘以列中的值:' convrt'然后是一个新的数据帧,df3应如下所示:

       inp   aco   drtn                 count   convrt   calc
      2.3.6  dp   Less than 1 min         1     0.59     0.59
      2.3.6  ft   5-10 min                1      10      10
      2.5.9  dp   More than 1 hour        1      61      61
      0.8.0  dp    1-5 min                2       5      10
      2.3.6  dp   10-30 min               1      30      30
      2.3.6  dp   More than 1 hour        1      61      61
      6.2.6  dp   1-5 min                 1       5       5

然后是一个新的数据框,df4由列过滤:' aco'。例如:只应保留具有dp的值。然后是一个新列:pct,用于计算列中更改的百分比:' calc'。

       inp   aco   drtn                 count   convrt   calc   pct
      2.3.6  dp   Less than 1 min         1     0.59     0.59   0.003
      2.5.9  dp   More than 1 hour        1      61      61     36.40
      0.8.0  dp    1-5 min                2       5      10     0.060
      2.3.6  dp   10-30 min               1      30      30     17.90 
      2.3.6  dp   More than 1 hour        1      61      61     36.40
      6.2.6  dp   1-5 min                 1       5       5     0.030 

然后是一个新的数据框,df5带有一个新列:' pct'它会添加列中的所有值:' calc' df3以及列中:' calc' df4并继续划分(df4 / df3)并乘以指定为列的过滤值的索引:' aco'。然后,新列" totalCalcFilteredColumn'得到列的总和:' calc'在过滤的数据帧中,df4。 另一栏名为:'差异'它会添加列中的所有值:' calc' df3以及列中:' calc'然后进行减法(df3-df4)

        pctTime      totalCalcFilteredColumn     diff    
  dp    94.37         167.59                     10        
  ft    5.63           10                       167.59    

我该如何解决这个问题?

1 个答案:

答案 0 :(得分:1)

设定:

temp=u"""inp;aco;drtn
2.3.6;dp;Less than 1 min
2.3.6;ft;5-10 min
2.5.9;dp;More than 1 hour
0.8.0;dp;1-5 min
2.3.6;dp;10-30 min
2.3.6;dp;More than 1 hour
0.8.0;dp;1-5 min
6.2.6;dp;1-5 min"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'

df = pd.read_csv(pd.compat.StringIO(temp), sep=";")
print (df)
     inp aco              drtn
0  2.3.6  dp   Less than 1 min
1  2.3.6  ft          5-10 min
2  2.5.9  dp  More than 1 hour
3  0.8.0  dp           1-5 min
4  2.3.6  dp         10-30 min
5  2.3.6  dp  More than 1 hour
6  0.8.0  dp           1-5 min
7  6.2.6  dp           1-5 min

解决方案:

d = {'1-5 min': 5, '10-30 min': 30, '5-10 min': 10,
'Less than 1 min': 0.59, 'More than 1 hour': 61}

df = df.groupby(['inp', 'aco', 'drtn'], sort=False).size().reset_index(name='count')
#map column by dictionary
df['convrt'] = df['drtn'].map(d)
df['calc'] = df['convrt'].mul(df['count'])
#divide by groups - transform create Series with same size as original df
df['pct'] = df['calc'].div(df.groupby('aco')['calc'].transform('sum')).mul(100)
print (df)
     inp aco              drtn  count  convrt   calc         pct
0  2.3.6  dp   Less than 1 min      1    0.59   0.59    0.352050
1  2.3.6  ft          5-10 min      1   10.00  10.00  100.000000
2  2.5.9  dp  More than 1 hour      1   61.00  61.00   36.398353
3  0.8.0  dp           1-5 min      2    5.00  10.00    5.966943
4  2.3.6  dp         10-30 min      1   30.00  30.00   17.900829
5  2.3.6  dp  More than 1 hour      1   61.00  61.00   36.398353
6  6.2.6  dp           1-5 min      1    5.00   5.00    2.983472
#aggregate sum 
df = df.groupby('aco')['calc'].sum().reset_index(name='totalCalcFilteredColumn')
summed = df['totalCalcFilteredColumn'].sum()
df['pctTime'] = df['totalCalcFilteredColumn'].div(summed).mul(100)
#rsub means sub from right  summed - df['calc']
df['diff'] = df['totalCalcFilteredColumn'].rsub(summed)
print (df)
  aco  totalCalcFilteredColumn    pctTime    diff
0  dp                   167.59  94.369052   10.00
1  ft                    10.00   5.630948  167.59