将重复列转换为行

时间:2013-05-16 15:41:27

标签: python pandas

这是我读取csv文件时的输入文件:

Sample Info     D3S1358 1       D3S1358 2       TH01 1      TH01 2      D21S11 1        D21S11 2        D21S11 3
TEST_646            17          17                  9       9.3         28                  28          nan
TEST_647            18          18                  7       7           29                  30          30.2
TEST_648            16          16                  9       9           31.2                31.2        nan

我想将其转换为这样的形式:

Sample_name  Marker     mrk     value
TEST_646     D3S1358     1      17
TEST_646     D3S1358     2      17
TEST_646     TH01        1      9
TEST_646     TH01        2      9.3
TEST_646     D21S11      1      28.0
TEST_646     D21S11      2      28.0
TEST_646     D21S11      3      nan

PS。以下是逗号分隔形式的值,以方便您使用:

Sample Info, D3S1358 1, D3S1358 2, TH01 1, TH01 2, D21S11 1, D21S11 2, D21S11 3
TEST_646, 17, 17, 9, 9.3, 28, 28, nan
TEST_647, 18, 18, 7, 7, 29, 30, 30.2
TEST_648, 16, 16, 9, 9, 31.2, 31.2, nan

到目前为止我的解决方案是:

samples = xls.parse(sheet).set_index('Sample Info')
cols = list(set(filter(None, [i[:-2] if i!="Sample Info" else None for i in samples.columns])))
sample_df_d= {'1' : pd.Series( len(cols)*[''], index=cols), '2' : pd.Series( len(cols)*[''], index=cols), '3' : pd.Series( len(cols)*[''], index=cols)}
sample_df_ = pd.DataFrame(sample_df_d)
sample_ser = sample_df_.stack()
sample_df = pd.DataFrame(sample_ser, columns=['value'])
#print sample_df

for i,j in samples.iterrows():
    for i2,j2 in j.iteritems():
            print j[0], i2[:-2], "\t", i2[-2:],"\t", j2

会产生类似这样的东西:

17 D3S1358   1  17
17 D3S1358   2  17
17 TH01      1  9
17 TH01      2  9.3
17 D21S11    1  28.0

1 个答案:

答案 0 :(得分:5)

以下是堆叠方式,首先将列清理为MultiIndex

In [11]: df_1 = df0.set_index('Sample Info')

In [12]: df_1.columns = pd.MultiIndex.from_arrays(zip(*df_1.columns.map(str.split)),
                                                  names=['Marker', 'mrk'])

In [13]: df_1
Out[13]:
Marker       D3S1358      TH01       D21S11
mrk                1   2     1    2       1     2     3
Sample Info
TEST_646          17  17     9  9.3    28.0  28.0   NaN
TEST_647          18  18     7  7.0    29.0  30.0  30.2
TEST_648          16  16     9  9.0    31.2  31.2   NaN

然后你可以stack(首先是'Marker'然后是'mrk'):

In [14]: df_2 = df_1.stack(level=['Marker', 'mrk'])

In [15]: df_2
Sample Info  Marker   mrk
TEST_646     D21S11   1      28.0
                      2      28.0
             D3S1358  1      17.0
                      2      17.0
             TH01     1       9.0
                      2       9.3
TEST_647     D21S11   1      29.0
                      2      30.0
                      3      30.2
             D3S1358  1      18.0
                      2      18.0
             TH01     1       7.0
                      2       7.0
TEST_648     D21S11   1      31.2
                      2      31.2
             D3S1358  1      16.0
                      2      16.0
             TH01     1       9.0
                      2       9.0
dtype: float64

如果您希望将其重新设置为列,则可以重置_index:

df_2.reset_index()