Python Pandas:将两列或多列中的字符串斜线拆分为多行

时间:2016-08-01 23:09:13

标签: python pandas dataframe

我有一个像这样的pandas数据框:

SUBJECT                 STUDENT         CITY        STATE

Math/Chemistry/Biology  Sam/Peter/Mary  Los Angeles CA
Geology/Physics         John            Boston      MA

这应该是这样的:

SUBJECT      STUDENT    CITY           STATE

Math         Sam        Los Angeles    CA
Chemistry    Peter      Los Angeles    CA
Biology      Mary       Los Angeles    CA
Geology      John       Boston         MA
Physics      John       Boston         MA

在提出这个问题之前,我提到了本页提到的解决方案: pandas: How do I split text in a column into multiple rows?

由于两列中存在斜杠分隔的字符串,因此我无法使用上述链接中的解决方案。

3 个答案:

答案 0 :(得分:3)

concatjoin的另一种解决方案:

s1 = df.SUBJECT.str.split('/', expand=True).stack()
s2 = df.STUDENT.str.split('/', expand=True).stack()
print (s1)
0  0         Math
   1    Chemistry
   2      Biology
1  0      Geology
   1      Physics

print (s2)
0  0      Sam
   1    Peter
   2     Mary
1  0     John
dtype: object
df1 = pd.concat([s1,s2], axis=1, keys=('SUBJECT','STUDENT'))
        .ffill()
        .reset_index(level=1, drop=True)
print (df1)
     SUBJECT STUDENT
0       Math     Sam
0  Chemistry   Peter
0    Biology    Mary
1    Geology    John
1    Physics    John

df = df.drop(['SUBJECT','STUDENT'], axis=1)
       .join(df1)
       .reset_index(drop=True)[['SUBJECT', 'STUDENT', 'CITY','STATE']]
print (df)
     SUBJECT STUDENT         CITY STATE
0       Math     Sam  Los Angeles    CA
1  Chemistry   Peter  Los Angeles    CA
2    Biology    Mary  Los Angeles    CA
3    Geology    John       Boston    MA
4    Physics    John       Boston    MA

答案 1 :(得分:2)

首先,按'/'

分割字段
df.SUBJECT = df.SUBJECT.str.split('/')
df.STUDENT = df.STUDENT.str.split('/')

然后我使用一个函数来爆炸行。但是,我不得不隔离那些只有一个学生或主题的行。

def explode(df, columns):
    idx = np.repeat(df.index, df[columns[0]].str.len())
    a = df.T.reindex_axis(columns).values
    concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
    p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
    return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)


cond = df.STUDENT.str.len() == df.SUBJECT.str.len()

df_paired = df[cond]
df_unpard = df[~cond]

if not df_paired.empty:
    df_paired = explode(df_paired, ['STUDENT','SUBJECT'])

if not df_unpard.empty:
    df_unpard = explode(explode(df_unpard, ['STUDENT']), ['SUBJECT'])

最后

pd.concat([df_paired, df_unpard], ignore_index=True)[df.columns]

enter image description here

时序

<强> piRSquared

%%timeit

df = df_.copy()

df.SUBJECT = df.SUBJECT.str.split('/')
df.STUDENT = df.STUDENT.str.split('/')

def explode(df, columns):
    idx = np.repeat(df.index, df[columns[0]].str.len())
    a = df.T.reindex_axis(columns).values
    concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
    p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
    return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)

cond = df.STUDENT.str.len() == df.SUBJECT.str.len()

df_paired = df[cond]
df_unpard = df[~cond]

if not df_paired.empty:
    df_paired = explode(df_paired, ['STUDENT','SUBJECT'])

if not df_unpard.empty:
    df_unpard = explode(explode(df_unpard, ['STUDENT']), ['SUBJECT'])

pd.concat([df_paired, df_unpard], ignore_index=True)[df.columns]

100 loops, best of 3: 7.76 ms per loop

<强> jezrael

%%timeit

df = df_.copy()

s1 = df.SUBJECT.str.split('/', expand=True).stack()
s2 = df.STUDENT.str.split('/', expand=True).stack()

df1 = pd.concat([s1,s2], axis=1, keys=('SUBJECT','STUDENT')) \
        .ffill() \
        .reset_index(level=1, drop=True)

df.drop(['SUBJECT','STUDENT'], axis=1) \
  .join(df1) \
  .reset_index(drop=True)[['SUBJECT', 'STUDENT', 'CITY','STATE']]

100 loops, best of 3: 5.13 ms per loop

答案 2 :(得分:1)

试试这个:可以修改SUBJECT等于1,然后使用zip

df3.SUBJECT = df3.SUBJECT.str.split('/')
df3.STUDENT = df3.STUDENT.str.split('/')

def splitter(gb):
    ll = []
    subs, stus = gb.SUBJECT.values[0], gb.STUDENT.values[0]

    if   len(stus) == len(subs): ll = zip(subs,stus)
    elif len(stus) == 1:         ll = zip(subs,stus*len(subs))
    return pd.DataFrame(ll, columns= (["SUBJECT","STUDENT"])) 

df = df3.groupby(['CITY','STATE'])['SUBJECT','STUDENT'].apply(splitter).reset_index().drop('level_2', axis =1)
print df[[ 'SUBJECT', 'STUDENT', 'CITY','STATE' ]]


   SUBJECT STUDENT        CITY STATE
0    Geology    John      Boston    MA
1    Physics    John      Boston    MA
2       Math     Sam  LosAngeles    CA
3  Chemistry   Peter  LosAngeles    CA
4    Biology    Mary  LosAngeles    CA