How to split string from column to create long format dataframe

时间:2015-06-15 14:29:00

标签: python pandas dataframe

If I have the dataframe shown below, how do I make a long format dataframe (I.e. one term per gene per row).

I guess I will have to apply or map a split(",") to the Term column, but what do I do after that?

import pandas as pd
from StringIO import StringIO

df = pd.read_table(StringIO("""Gene    Terms
Mt-nd1  GO:0005739,GO:0005743,GO:0016021,GO:0030425,GO:0043025,GO:0070469,GO:0005623,GO:0005622,GO:0005737
Madd    GO:0016021,GO:0045202,GO:0005886
Zmiz1   GO:0005654,GO:0043231
Cdca7   GO:0005622,GO:0005623,GO:0005737,GO:0005634,GO:0005654"""), sep="\s+")

Ps. the table above is simplified, the actual df will have many more columns.

Psps. In case I was unclear, I want to end up with something like:

Mt-nd1  GO:0005739
Mt-nd1  GO:0005743
Mt-nd1  GO:0016021
...
Cdca7   GO:0005634
Cdca7   GO:0005654

1 个答案:

答案 0 :(得分:4)

You can use str.split to do the splitting (instead of apply and split approach, but similar):

In [6]: splitted = df['Terms'].str.split(',', expand=True)

In [7]: splitted 
Out[7]:
            0           1           2           3           4           5  \
0  GO:0005739  GO:0005743  GO:0016021  GO:0030425  GO:0043025  GO:0070469
1  GO:0016021  GO:0045202  GO:0005886         NaN         NaN         NaN
2  GO:0005654  GO:0043231         NaN         NaN         NaN         NaN
3  GO:0005622  GO:0005623  GO:0005737  GO:0005634  GO:0005654         NaN

            6           7           8
0  GO:0005623  GO:0005622  GO:0005737
1         NaN         NaN         NaN
2         NaN         NaN         NaN
3         NaN         NaN         NaN

To turn it into columns (instead of a list), you can use expand=True keyword to split, or for older pandas versions you can do df['Terms'].str.split(',').apply(pd.Series) to obtain the same.

Now, to obtain your desired output we have to stack these columns, but first merge it with the genes column to have this information in the stacked frame:

In [14]: stacked = pd.concat([df['Gene'], splitted],axis=1).set_index('Gene').stack()
In [15]: stacked
Out[15]:
Gene
Mt-nd1  0    GO:0005739
        1    GO:0005743
        2    GO:0016021
        3    GO:0030425
        4    GO:0043025
        5    GO:0070469
        6    GO:0005623
        7    GO:0005622
        8    GO:0005737
Madd    0    GO:0016021
        1    GO:0045202
        2    GO:0005886
Zmiz1   0    GO:0005654
        1    GO:0043231
Cdca7   0    GO:0005622
        1    GO:0005623
        2    GO:0005737
        3    GO:0005634
        4    GO:0005654
dtype: object

From here, we can reset the index, rename our column with terms, and drop the integer column (from the automatically generated column names) we don't need anymore:

In [19]: stacked.rename(columns={0:'Term'}).drop('level_1', axis=1)
Out[19]:
      Gene        Term
0   Mt-nd1  GO:0005739
1   Mt-nd1  GO:0005743
2   Mt-nd1  GO:0016021
3   Mt-nd1  GO:0030425
4   Mt-nd1  GO:0043025
5   Mt-nd1  GO:0070469
6   Mt-nd1  GO:0005623
7   Mt-nd1  GO:0005622
8   Mt-nd1  GO:0005737
9     Madd  GO:0016021
10    Madd  GO:0045202
11    Madd  GO:0005886
12   Zmiz1  GO:0005654
13   Zmiz1  GO:0043231
14   Cdca7  GO:0005622
15   Cdca7  GO:0005623
16   Cdca7  GO:0005737
17   Cdca7  GO:0005634
18   Cdca7  GO:0005654

How this can be combined or merged with the other columns you have, will depend on what you exactly want to do with it.