If I have the dataframe shown below, how do I make a long format dataframe (I.e. one term per gene per row).
I guess I will have to apply
or map a split(",")
to the Term
column, but what do I do after that?
import pandas as pd
from StringIO import StringIO
df = pd.read_table(StringIO("""Gene Terms
Mt-nd1 GO:0005739,GO:0005743,GO:0016021,GO:0030425,GO:0043025,GO:0070469,GO:0005623,GO:0005622,GO:0005737
Madd GO:0016021,GO:0045202,GO:0005886
Zmiz1 GO:0005654,GO:0043231
Cdca7 GO:0005622,GO:0005623,GO:0005737,GO:0005634,GO:0005654"""), sep="\s+")
Ps. the table above is simplified, the actual df
will have many more columns.
Psps. In case I was unclear, I want to end up with something like:
Mt-nd1 GO:0005739
Mt-nd1 GO:0005743
Mt-nd1 GO:0016021
...
Cdca7 GO:0005634
Cdca7 GO:0005654
答案 0 :(得分:4)
You can use str.split
to do the splitting (instead of apply and split approach, but similar):
In [6]: splitted = df['Terms'].str.split(',', expand=True)
In [7]: splitted
Out[7]:
0 1 2 3 4 5 \
0 GO:0005739 GO:0005743 GO:0016021 GO:0030425 GO:0043025 GO:0070469
1 GO:0016021 GO:0045202 GO:0005886 NaN NaN NaN
2 GO:0005654 GO:0043231 NaN NaN NaN NaN
3 GO:0005622 GO:0005623 GO:0005737 GO:0005634 GO:0005654 NaN
6 7 8
0 GO:0005623 GO:0005622 GO:0005737
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
To turn it into columns (instead of a list), you can use expand=True
keyword to split
, or for older pandas versions you can do df['Terms'].str.split(',').apply(pd.Series)
to obtain the same.
Now, to obtain your desired output we have to stack these columns, but first merge it with the genes column to have this information in the stacked frame:
In [14]: stacked = pd.concat([df['Gene'], splitted],axis=1).set_index('Gene').stack()
In [15]: stacked
Out[15]:
Gene
Mt-nd1 0 GO:0005739
1 GO:0005743
2 GO:0016021
3 GO:0030425
4 GO:0043025
5 GO:0070469
6 GO:0005623
7 GO:0005622
8 GO:0005737
Madd 0 GO:0016021
1 GO:0045202
2 GO:0005886
Zmiz1 0 GO:0005654
1 GO:0043231
Cdca7 0 GO:0005622
1 GO:0005623
2 GO:0005737
3 GO:0005634
4 GO:0005654
dtype: object
From here, we can reset the index, rename our column with terms, and drop the integer column (from the automatically generated column names) we don't need anymore:
In [19]: stacked.rename(columns={0:'Term'}).drop('level_1', axis=1)
Out[19]:
Gene Term
0 Mt-nd1 GO:0005739
1 Mt-nd1 GO:0005743
2 Mt-nd1 GO:0016021
3 Mt-nd1 GO:0030425
4 Mt-nd1 GO:0043025
5 Mt-nd1 GO:0070469
6 Mt-nd1 GO:0005623
7 Mt-nd1 GO:0005622
8 Mt-nd1 GO:0005737
9 Madd GO:0016021
10 Madd GO:0045202
11 Madd GO:0005886
12 Zmiz1 GO:0005654
13 Zmiz1 GO:0043231
14 Cdca7 GO:0005622
15 Cdca7 GO:0005623
16 Cdca7 GO:0005737
17 Cdca7 GO:0005634
18 Cdca7 GO:0005654
How this can be combined or merged with the other columns you have, will depend on what you exactly want to do with it.