pandas合并dataframe和pivot创建新列

时间:2015-07-14 22:24:22

标签: python pandas pivot-table tsv

我有两个输入数据帧

df1 (请注意,此DF可能包含更多数据列)

var renderString = "{% import 'packageForms.html' as forms %} \n";
renderString = renderString + '{{ form.packageForm("task") }}';
var renderedHTML = nunjucks.renderString(renderString);
$('#page').append(renderedHTML)

df2

   Sample Animal  Time     Sex
0       1      A   one    male
1       2      A   two    male
2       3      B   one  female
3       4      C   one    male
4       5      D   one  female

我希望将它们结合起来,以便得到以下内容:

          a    b    c
Sample               
1       0.2  0.4  0.3
2       0.5  0.7  0.2
3       0.4  0.1  0.9
4       0.4  0.2  0.3
5       0.6  0.2  0.4

这就是我在做的事情:

        one_a  one_b  one_c  two_a  two_b  two_c     Sex
Animal                                                  
A         0.2    0.4    0.3    0.5    0.7    0.2    male
B         0.4    0.1    0.9    NaN    NaN    NaN  female
C         0.4    0.2    0.3    NaN    NaN    NaN    male
D         0.6    0.2    0.4    NaN    NaN    NaN  female

这很好用,但对于大型数据集可能会很慢。我想知道是否有任何熊猫专业人士看到更好(阅读更快,更有效率)?我是大熊猫的新手,可以想象这里有一些我不知道的捷径。

1 个答案:

答案 0 :(得分:1)

这里有几步。关键是要创建one_a one_b .... two_c之类的列,我们需要将Time列添加到Sample索引以构建多级索引,然后unstack来获取所需的列形成。然后,需要groupby on Animal索引来聚合并减少NaN的数量。其余的只是对格式的一些操纵。

import pandas as pd

# your data
# ==============================
# set index
df1 = df1.set_index('Sample')

print(df1)

       Animal Time     Sex
Sample                    
1           A  one    male
2           A  two    male
3           B  one  female
4           C  one    male
5           D  one  female

print(df2)


          a    b    c
Sample               
1       0.2  0.4  0.3
2       0.5  0.7  0.2
3       0.4  0.1  0.9
4       0.4  0.2  0.3
5       0.6  0.2  0.4



# processing
# =============================
df = df1.join(df2)

df_temp = df.set_index(['Animal', 'Sex','Time'], append=True).unstack()

print(df_temp)


                        a         b         c     
Time                  one  two  one  two  one  two
Sample Animal Sex                                 
1      A      male    0.2  NaN  0.4  NaN  0.3  NaN
2      A      male    NaN  0.5  NaN  0.7  NaN  0.2
3      B      female  0.4  NaN  0.1  NaN  0.9  NaN
4      C      male    0.4  NaN  0.2  NaN  0.3  NaN
5      D      female  0.6  NaN  0.2  NaN  0.4  NaN

# rename the columns if you wish
df_temp.columns = ['{}_{}'.format(x, y) for x, y in zip(df_temp.columns.get_level_values(1), df_temp.columns.get_level_values(0))]

print(df_temp)

                      one_a  two_a  one_b  two_b  one_c  two_c
Sample Animal Sex                                             
1      A      male      0.2    NaN    0.4    NaN    0.3    NaN
2      A      male      NaN    0.5    NaN    0.7    NaN    0.2
3      B      female    0.4    NaN    0.1    NaN    0.9    NaN
4      C      male      0.4    NaN    0.2    NaN    0.3    NaN
5      D      female    0.6    NaN    0.2    NaN    0.4    NaN


result = df_temp.reset_index('Sex').groupby(level='Animal').agg(max).sort_index(axis=1)

print(result)

           Sex  one_a  one_b  one_c  two_a  two_b  two_c
Animal                                                  
A         male    0.2    0.4    0.3    0.5    0.7    0.2
B       female    0.4    0.1    0.9    NaN    NaN    NaN
C         male    0.4    0.2    0.3    NaN    NaN    NaN
D       female    0.6    0.2    0.4    NaN    NaN    NaN