基于列值作为键进行取消堆叠和合并

时间:2017-03-29 08:07:54

标签: python pandas mapping

我从我的grep命令得到一个输出,如下所示。

 grep -r GFD . | cut -d: -f2
out_GFD_994     NSE_FO_BHP_1703   -9425     6800       361.45      11900     359.96      5100      0.34%    6137085.0  -15.36
out_GFD_994     NSE_FO_BHP_1704   15651     -6800      360.38      6800      362.04      13600     26.66%   7374430.0  21.22 
out_GFD_994     NSE_FO_TLS_1703        -4996     2000       603.57      5000      602.68      3000      0.46%    4825900.0  -10.35
out_GFD_994     NSE_FO_TLS_1704        4480      -2000      605.71      3000      606.44      5000      29.62%   4849350.0  9.24  
out_GFD_994     NSE_FO_MQG_1703          -11717    -20000     148.64      50000     148.64      70000     0.46%    17837250.0  -6.57 
out_GFD_994     NSE_FO_MQG_1704          17213     20000      149.29      75000     149.39      55000     36.11%   19413500.0  8.87  
out_GFD_Part2                        NSE_FO_BHP_1703   -17597    -20000     0           0         39.25       20000     0.07%    785000.0  -224.17
out_GFD_Part2                        NSE_FO_BHP_1704   14481     20000      39.6        20000     0           0         1.38%    792000.0  182.84
out_GFD_Part2                        NSE_FO_TLS_1703         28312     1200       643.93      16800     645.52      15600     0.54%    20888220.0  13.55 
out_GFD_Part2                        NSE_FO_TLS_1704         -23813    -1200      647.91      16800     646.87      18000     34.11%   22528620.0  -10.57
out_GFD_Part2                        NSE_FO_MQG_1703   -133456    8800       1029.33     25300     1025.86     16500     0.55%    42968915.0  -31.06 unhedged
out_GFD_Part2                        NSE_FO_MQG_1704   141534    -7700      1031.26     33000     1033.85     40700     49.62%   76109605.0  18.60 

我需要根据第二列值作为关键字进行清除/合并/取消堆叠(无论哪个听起来合适)。 因此输出数据转换为(相应值的总和)

out_GFD_994     out_GFD_Part2    NSE_FO_BHP_1703   -9425-17597     6800-20000       361.45      11900     359.96      5100      0.34%    6137085.0  -15.36
out_GFD_994     out_GFD_Part2    NSE_FO_BHP_1704   15651+14481     -6800+20000      360.38      6800      362.04      13600     26.66%   7374430.0  21.22 
out_GFD_994     out_GFD_Part2    NSE_FO_TLS_1703        -4996+28312     2000+1200       603.57      5000      602.68      3000      0.46%    4825900.0  -10.35
out_GFD_994     out_GFD_Part2    NSE_FO_TLS_1704        4480-23813     -2000-1200      605.71      3000      606.44      5000      29.62%   4849350.0  9.24  
out_GFD_994     out_GFD_Part2    NSE_FO_MQG_1703          -11717-133456    -20000+8800     148.64      50000     148.64      70000     0.46%    17837250.0  -6.57 
out_GFD_994     out_GFD_Part2    NSE_FO_MQG_1704          17213+141534     20000-7700      149.29      75000     149.39      55000     36.11%   19413500.0  8.87  

                                       (only 2 columns shown in expected output format)

我可以为列命名并将其加载为pandas数据帧,如果这为解决此问题铺平了道路。

更新1:

我现在只处理5列,并将其加载到我的pandas数据框中,如下所示

>>> df
      grep_string              key    val1   val2     val3
0     out_GFD_994  NSE_FO_BHP_1703   -9425   6800   361.45
1     out_GFD_994  NSE_FO_BHP_1704   15651  -6800   360.38
2     out_GFD_994  NSE_FO_TLS_1703   -4996   2000   603.57
3     out_GFD_994  NSE_FO_TLS_1704    4480  -2000   605.71
4     out_GFD_994  NSE_FO_MQG_1703  -11717 -20000   148.64
5     out_GFD_994  NSE_FO_MQG_1704   17213  20000   149.29
6   out_GFD_Part2  NSE_FO_BHP_1703  -17597 -20000     0.00
7   out_GFD_Part2  NSE_FO_BHP_1704   14481  20000    39.60
8   out_GFD_Part2  NSE_FO_TLS_1703   28312   1200   643.93
9   out_GFD_Part2  NSE_FO_TLS_1704  -23813  -1200   647.91
10  out_GFD_Part2  NSE_FO_MQG_1703 -133456   8800  1029.33
11  out_GFD_Part2  NSE_FO_MQG_1704  141534  -7700  1031.26

如何使用键列

进行(求和)合并

更新2:

将汇总的列值添加到日志文件中,如下所示:

NSE_FO_BHP_1703_MAXLONGPOS = 200000
NSE_FO_BHP_1703_MAXSHORTPOS = 200000
NSE_FO_BHP_1703_MAXLONGEXPOSURE = 250000
NSE_FO_BHP_1703_MAXSHORTEXPOSURE = 250000
NSE_FO_BHP_1704_MAXLONGPOS = 200000
NSE_FO_BHP_1704_MAXSHORTPOS = 200000
NSE_FO_BHP_1704_MAXLONGEXPOSURE = 250000
NSE_FO_BHP_1704_MAXSHORTEXPOSURE = 250000
NSE_FO_TLS_1703_MAXLONGPOS = 100000
NSE_FO_TLS_1703_MAXSHORTPOS = 100000
NSE_FO_TLS_1703_MAXLONGEXPOSURE = 200000
NSE_FO_TLS_1703_MAXSHORTEXPOSURE = 200000
NSE_FO_TLS_1704_MAXLONGPOS = 100000
NSE_FO_TLS_1704_MAXSHORTPOS = 100000
NSE_FO_TLS_1704_MAXLONGEXPOSURE = 200000
NSE_FO_TLS_1704_MAXSHORTEXPOSURE = 200000
NSE_FO_MQG_1703_MAXLONGPOS = 300000
NSE_FO_MQG_1703_MAXSHORTPOS = 300000
NSE_FO_MQG_1703_MAXLONGEXPOSURE = 400000
NSE_FO_MQG_1703_MAXSHORTEXPOSURE = 400000
NSE_FO_DEF_1704_MAXLONGPOS = 300000
NSE_FO_MQG_1704_MAXSHORTPOS = 300000
NSE_FO_MQG_1704_MAXLONGEXPOSURE = 400000
NSE_FO_MQG_1704_MAXSHORTEXPOSURE = 400000

我们可以通过将它们映射到子字符串来添加我们在df(比如列d)中得到的求和输出值,以将其添加/减去上述文件。例如,我们在d栏中得到-13200。我们有NSE_FO_BHP_1703_MAXLONGPOS = 200000。在a文件中,将其更改为213200并更改NSE_FO_BHP_1703_MAXSHORTPOS to 186800。更改MAXLONGEXPOSURE and MAXSHORTEXPOSURE to 263200 and 236800

1 个答案:

答案 0 :(得分:1)

您可以groupby使用由dict comprehension创建的词典agg print (df) 0 1 2 3 4 5 6 \ 0 out_GFD_994 NSE_FO_BHP_1703 -9425 6800 361.45 11900 359.96 1 out_GFD_994 NSE_FO_BHP_1704 15651 -6800 360.38 6800 362.04 2 out_GFD_994 NSE_FO_TLS_1703 -4996 2000 603.57 5000 602.68 3 out_GFD_994 NSE_FO_TLS_1704 4480 -2000 605.71 3000 606.44 4 out_GFD_994 NSE_FO_MQG_1703 -11717 -20000 148.64 50000 148.64 5 out_GFD_994 NSE_FO_MQG_1704 17213 20000 149.29 75000 149.39 6 out_GFD_Part2 NSE_FO_BHP_1703 -17597 -20000 0.00 0 39.25 7 out_GFD_Part2 NSE_FO_BHP_1704 14481 20000 39.60 20000 0.00 8 out_GFD_Part2 NSE_FO_TLS_1703 28312 1200 643.93 16800 645.52 9 out_GFD_Part2 NSE_FO_TLS_1704 -23813 -1200 647.91 16800 646.87 10 out_GFD_Part2 NSE_FO_MQG_1703 -133456 8800 1029.33 25300 1025.86 11 out_GFD_Part2 NSE_FO_MQG_1704 141534 -7700 1031.26 33000 1033.85 7 8 9 10 0 5100 0.34% 6137085.0 -15.36 1 13600 26.66% 7374430.0 21.22 2 3000 0.46% 4825900.0 -10.35 3 5000 29.62% 4849350.0 9.24 4 70000 0.46% 17837250.0 -6.57 5 55000 36.11% 19413500.0 8.87 6 20000 0.07% 785000.0 -224.17 7 0 1.38% 792000.0 182.84 8 15600 0.54% 20888220.0 13.55 9 18000 34.11% 22528620.0 -10.57 10 16500 0.55% 42968915.0 -31.06 11 40700 49.62% 76109605.0 18.60 。最后从第一列创建另外2 split

#sum all columns without first,second and 9 column with percentage
d = {x:'sum' for x in df if not x in [0,1,8]}
#add custom function for first column
d.update({0:'|'.join})
print (d)
{0: <built-in method join of str object at 0x0000000001180AE8>, 2: 'sum', 
 3: 'sum', 4: 'sum', 5: 'sum', 6: 'sum', 7: 'sum', 9: 'sum', 10: 'sum'}

df = df.groupby(1).agg(d).reset_index()
df[[-2,-1]] = df.pop(0).str.split('|', expand=True)
#change order of columns
df = df.sort_index(axis=1)
#reset column names to  default (0,1...)
df.columns = np.arange(len(df.columns))
print (df)
            0              1                2       3      4        5   \
0  out_GFD_994  out_GFD_Part2  NSE_FO_BHP_1703  -27022 -13200   361.45   
1  out_GFD_994  out_GFD_Part2  NSE_FO_BHP_1704   30132  13200   399.98   
2  out_GFD_994  out_GFD_Part2  NSE_FO_MQG_1703 -145173 -11200  1177.97   
3  out_GFD_994  out_GFD_Part2  NSE_FO_MQG_1704  158747  12300  1180.55   
4  out_GFD_994  out_GFD_Part2  NSE_FO_TLS_1703   23316   3200  1247.50   
5  out_GFD_994  out_GFD_Part2  NSE_FO_TLS_1704  -19333  -3200  1253.62   

       6        7      8           9       10  
0   11900   399.21  25100   6922085.0 -239.53  
1   26800   362.04  13600   8166430.0  204.06  
2   75300  1174.50  86500  60806165.0  -37.63  
3  108000  1183.24  95700  95523105.0   27.47  
4   21800  1248.20  18600  25714120.0    3.20  
5   19800  1253.31  23000  27377970.0   -1.33  
df.columns = list('abcdefghijk')
print (df)
                a                b       c      d        e      f        g  \
0     out_GFD_994  NSE_FO_BHP_1703   -9425   6800   361.45  11900   359.96   
1     out_GFD_994  NSE_FO_BHP_1704   15651  -6800   360.38   6800   362.04   
2     out_GFD_994  NSE_FO_TLS_1703   -4996   2000   603.57   5000   602.68   
3     out_GFD_994  NSE_FO_TLS_1704    4480  -2000   605.71   3000   606.44   
4     out_GFD_994  NSE_FO_MQG_1703  -11717 -20000   148.64  50000   148.64   
5     out_GFD_994  NSE_FO_MQG_1704   17213  20000   149.29  75000   149.39   
6   out_GFD_Part2  NSE_FO_BHP_1703  -17597 -20000     0.00      0    39.25   
7   out_GFD_Part2  NSE_FO_BHP_1704   14481  20000    39.60  20000     0.00   
8   out_GFD_Part2  NSE_FO_TLS_1703   28312   1200   643.93  16800   645.52   
9   out_GFD_Part2  NSE_FO_TLS_1704  -23813  -1200   647.91  16800   646.87   
10  out_GFD_Part2  NSE_FO_MQG_1703 -133456   8800  1029.33  25300  1025.86   
11  out_GFD_Part2  NSE_FO_MQG_1704  141534  -7700  1031.26  33000  1033.85   

        h       i           j       k  
0    5100   0.34%   6137085.0  -15.36  
1   13600  26.66%   7374430.0   21.22  
2    3000   0.46%   4825900.0  -10.35  
3    5000  29.62%   4849350.0    9.24  
4   70000   0.46%  17837250.0   -6.57  
5   55000  36.11%  19413500.0    8.87  
6   20000   0.07%    785000.0 -224.17  
7       0   1.38%    792000.0  182.84  
8   15600   0.54%  20888220.0   13.55  
9   18000  34.11%  22528620.0  -10.57  
10  16500   0.55%  42968915.0  -31.06  
11  40700  49.62%  76109605.0   18.60

使用自定义列名称的解决方案:

d = {x:'sum' for x in df if not x in ['a','b', 'i']}
#add custom function for first column
d.update({'a':'|'.join})
print (d)
{'e': 'sum', 'k': 'sum', 'a': <built-in method join of str object at 0x0000000001180AE8>, 
'f': 'sum', 'd': 'sum', 'g': 'sum', 'j': 'sum', 'c': 'sum', 'h': 'sum'}
df = df.groupby('b').agg(d).reset_index()
df1 = df.pop('a').str.split('|', expand=True)
df1.columns = ['out_' + str(x) for x in df1.columns]
df = pd.concat([df1, df],axis=1)
print (df)
         out_0          out_1                b        e       k       f  \
0  out_GFD_994  out_GFD_Part2  NSE_FO_BHP_1703   361.45 -239.53   11900   
1  out_GFD_994  out_GFD_Part2  NSE_FO_BHP_1704   399.98  204.06   26800   
2  out_GFD_994  out_GFD_Part2  NSE_FO_MQG_1703  1177.97  -37.63   75300   
3  out_GFD_994  out_GFD_Part2  NSE_FO_MQG_1704  1180.55   27.47  108000   
4  out_GFD_994  out_GFD_Part2  NSE_FO_TLS_1703  1247.50    3.20   21800   
5  out_GFD_994  out_GFD_Part2  NSE_FO_TLS_1704  1253.62   -1.33   19800   

       d        g           j       c      h  
0 -13200   399.21   6922085.0  -27022  25100  
1  13200   362.04   8166430.0   30132  13600  
2 -11200  1174.50  60806165.0 -145173  86500  
3  12300  1183.24  95523105.0  158747  95700  
4   3200  1248.20  25714120.0   23316  18600  
5  -3200  1253.31  27377970.0  -19333  23000  
#content-inside {
    width:100%;
    max-width:inherit !important;
    padding:0 !important;
}