如何将单列pandas数据帧拆分为多个列?

时间:2017-09-11 05:53:47

标签: python pandas dataframe

我是python pandas的新手。我有一个如下数据框:

df = pd.DataFrame({'Name': ['football', 'ramesh','suresh','pankaj','cricket','rakesh','mohit','mahesh'],
               'age': ['25', '22','21','32','37','26','24','30']})
print df

       Name age
0  football  25
1    ramesh  22
2    suresh  21
3    pankaj  32
4   cricket  37
5    rakesh  26
6     mohit  24
7    mahesh  30

“名称”列还包含“体育名称”和“体育人名”。我想将它分成两个不同的列,如下所示:

预期输出:

sports_name sport_person_name age
football    ramesh            25
            suresh            22
            pankaj            32
cricket     rakesh            26
            mohit             24
            mahesh            30

如果我在“名称”列上创建groupby,我没有获得预期的输出,这显然是直接输出,因为“名称”列中没有重复项。我需要使用什么才能获得预期的输出?

编辑:如果不想对体育名称进行硬编码

df = pd.DataFrame({'Name': ['football', 'ramesh','suresh','pankaj','cricket','rakesh','mohit','mahesh'],
           'age': ['', '22','21','32','','26','24','30']})

df = df.replace('', np.nan, regex=True)

nan_rows = df[df.isnull().T.any().T]
sports = nan_rows['Name'].tolist()

df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill()
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
df = df[['sports_name','sport_person_name','age']]
print (df)

我刚检查过除“名称”列以外哪些行在所有其余列中包含NAN值,它肯定是体育名称。我创建了这些体育名称的列表,并使用以下解决方案创建sports_name和sports_person_name列。

2 个答案:

答案 0 :(得分:2)

您可以使用:

#define list of sports
sports = ['football','cricket']
#create NaNs if no sport in Name, forward filling NaNs
df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill()
#remove same values in columns sports_name and Name, rename column
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
#change order of columns
df = df[['sports_name','sport_person_name','age']]
print (df)
  sports_name sport_person_name age
0    football            ramesh  22
1    football            suresh  21
2    football            pankaj  32
3     cricket            rakesh  26
4     cricket             mohit  24
5     cricket            mahesh  30

使用DataFrame.insert的类似解决方案 - 然后重新排序是不必要的:

#define list of sports
sports = ['football','cricket']
#rename column by dict
d = {'Name':'sport_person_name'}
df = df.rename(columns=d)
#create NaNs if no sport in Name, forward filling NaNs
df.insert(0, 'sports_name', df['sport_person_name'].where(df['sport_person_name'].isin(sports)).ffill())
#remove same values in columns sports_name and Name
df = df[df['sports_name'] != df['sport_person_name']].reset_index(drop=True)
print (df)
  sports_name sport_person_name age
0    football            ramesh  22
1    football            suresh  21
2    football            pankaj  32
3     cricket            rakesh  26
4     cricket             mohit  24
5     cricket            mahesh  30

如果只想要一个运动值,请将limit=1添加到ffill并将NaN替换为空字符串:

sports = ['football','cricket']
df['sports_name'] = df['Name'].where(df['Name'].isin(sports)).ffill(limit=1).fillna('')
d = {'Name':'sport_person_name'}
df = df[df['sports_name'] != df['Name']].reset_index(drop=True).rename(columns=d)
df = df[['sports_name','sport_person_name','age']]
print (df)
  sports_name sport_person_name age
0    football            ramesh  22
1                        suresh  21
2                        pankaj  32
3     cricket            rakesh  26
4                         mohit  24
5                        mahesh  30

答案 1 :(得分:1)

您想要的输出是字典而不是数据帧。 字典将会显示:

EINVAL (sched_getaffinity() and, in kernels before 2.6.9, sched_setaffinity()) cpusetsize is smaller than the size of the affinity mask used by the kernel.

如果你真的想要一个数据帧: 如果名字总是出现在玩家面前:

{'Sport' : {'Player' : age,'Player2' : age}}

应该是什么样子:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Name': ['football','ramesh','suresh','pankaj','cricket' 
                  ,'rakesh','mohit','mahesh'],
                  'age': ['25', '22','21','32','37','26','24','30']})

sports=['football', 'cricket']
wanted_dict={}
current_sport=''

for val in df['sport_person_name']:
    if val in sports:
        current_sport=val
    else:
        wanted_dict[val]=current_sport

#Now you got - {name:sport_name,...}

df['sports_name']=999
for val in df['sport_person_name']
    df['sports_name']=np.where((val not in sports)&
                              (df['sport_person_name']==val),
                               wanted_dict[val],'sport)

df = df[df['sports_name']!='sport']