将pandas数据标准化为一对多关系

时间:2015-09-05 13:39:52

标签: pandas

我的数据框有一个以逗号分隔值保存为一列的列。

from StringIO import StringIO

myst="""india | 905034 | 19:44 | cricket, hockey  
USA | 905094  | 19:33 | swimming, running, tennis, football
Russia |  905154 |   21:56 | basketball

"""
u_cols=['country', 'index', 'current_tm', 'sports']

myf = StringIO(myst)
import pandas as pd
df = pd.read_csv(StringIO(myst), sep='|', names = u_cols)

是否有可能将细胞分成几行......

india cricket
india hockey
USA swimming
USA running
USA tennis
USA football
Russia basketball

1 个答案:

答案 0 :(得分:2)

您可以使用str.split,然后使用apply(pd.Series).stack()apply(pd.Series)生成不同的元素列,stack用于将其转换为行):

In [34]: df = df.set_index('country')

In [36]: s = df['sports'].str.split(',').apply(pd.Series).stack()

In [37]: s
Out[37]:
country
india    0        cricket
         1       hockey
USA      0       swimming
         1        running
         2         tennis
         3       football
Russia   0     basketball
dtype: object

然后进一步清理:

In [38]: s.reset_index(level=0).reset_index(drop=True)
Out[38]:
   country            0
0   india       cricket
1   india      hockey
2     USA      swimming
3     USA       running
4     USA        tennis
5     USA      football
6  Russia    basketball

注意,对于最近的pandas,您可以将.apply(pd.Series)替换为str.split中的expand=Truedf['sports'].str.split(',', expand=True).stack()