我有一个像这样的pandas数据框:
SUBJECT STUDENT CITY STATE
Math/Chemistry/Biology Sam/Peter/Mary Los Angeles CA
Geology/Physics John Boston MA
这应该是这样的:
SUBJECT STUDENT CITY STATE
Math Sam Los Angeles CA
Chemistry Peter Los Angeles CA
Biology Mary Los Angeles CA
Geology John Boston MA
Physics John Boston MA
在提出这个问题之前,我提到了本页提到的解决方案: pandas: How do I split text in a column into multiple rows?
由于两列中存在斜杠分隔的字符串,因此我无法使用上述链接中的解决方案。
答案 0 :(得分:3)
s1 = df.SUBJECT.str.split('/', expand=True).stack()
s2 = df.STUDENT.str.split('/', expand=True).stack()
print (s1)
0 0 Math
1 Chemistry
2 Biology
1 0 Geology
1 Physics
print (s2)
0 0 Sam
1 Peter
2 Mary
1 0 John
dtype: object
df1 = pd.concat([s1,s2], axis=1, keys=('SUBJECT','STUDENT'))
.ffill()
.reset_index(level=1, drop=True)
print (df1)
SUBJECT STUDENT
0 Math Sam
0 Chemistry Peter
0 Biology Mary
1 Geology John
1 Physics John
df = df.drop(['SUBJECT','STUDENT'], axis=1)
.join(df1)
.reset_index(drop=True)[['SUBJECT', 'STUDENT', 'CITY','STATE']]
print (df)
SUBJECT STUDENT CITY STATE
0 Math Sam Los Angeles CA
1 Chemistry Peter Los Angeles CA
2 Biology Mary Los Angeles CA
3 Geology John Boston MA
4 Physics John Boston MA
答案 1 :(得分:2)
首先,按'/'
df.SUBJECT = df.SUBJECT.str.split('/')
df.STUDENT = df.STUDENT.str.split('/')
然后我使用一个函数来爆炸行。但是,我不得不隔离那些只有一个学生或主题的行。
def explode(df, columns):
idx = np.repeat(df.index, df[columns[0]].str.len())
a = df.T.reindex_axis(columns).values
concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)
cond = df.STUDENT.str.len() == df.SUBJECT.str.len()
df_paired = df[cond]
df_unpard = df[~cond]
if not df_paired.empty:
df_paired = explode(df_paired, ['STUDENT','SUBJECT'])
if not df_unpard.empty:
df_unpard = explode(explode(df_unpard, ['STUDENT']), ['SUBJECT'])
最后
pd.concat([df_paired, df_unpard], ignore_index=True)[df.columns]
<强> piRSquared 强>
%%timeit
df = df_.copy()
df.SUBJECT = df.SUBJECT.str.split('/')
df.STUDENT = df.STUDENT.str.split('/')
def explode(df, columns):
idx = np.repeat(df.index, df[columns[0]].str.len())
a = df.T.reindex_axis(columns).values
concat = np.concatenate([np.concatenate(a[i]) for i in range(a.shape[0])])
p = pd.DataFrame(concat.reshape(a.shape[0], -1).T, idx, columns)
return pd.concat([df.drop(columns, axis=1), p], axis=1).reset_index(drop=True)
cond = df.STUDENT.str.len() == df.SUBJECT.str.len()
df_paired = df[cond]
df_unpard = df[~cond]
if not df_paired.empty:
df_paired = explode(df_paired, ['STUDENT','SUBJECT'])
if not df_unpard.empty:
df_unpard = explode(explode(df_unpard, ['STUDENT']), ['SUBJECT'])
pd.concat([df_paired, df_unpard], ignore_index=True)[df.columns]
100 loops, best of 3: 7.76 ms per loop
<强> jezrael 强>
%%timeit
df = df_.copy()
s1 = df.SUBJECT.str.split('/', expand=True).stack()
s2 = df.STUDENT.str.split('/', expand=True).stack()
df1 = pd.concat([s1,s2], axis=1, keys=('SUBJECT','STUDENT')) \
.ffill() \
.reset_index(level=1, drop=True)
df.drop(['SUBJECT','STUDENT'], axis=1) \
.join(df1) \
.reset_index(drop=True)[['SUBJECT', 'STUDENT', 'CITY','STATE']]
100 loops, best of 3: 5.13 ms per loop
答案 2 :(得分:1)
试试这个:可以修改SUBJECT
等于1,然后使用zip
。
df3.SUBJECT = df3.SUBJECT.str.split('/')
df3.STUDENT = df3.STUDENT.str.split('/')
def splitter(gb):
ll = []
subs, stus = gb.SUBJECT.values[0], gb.STUDENT.values[0]
if len(stus) == len(subs): ll = zip(subs,stus)
elif len(stus) == 1: ll = zip(subs,stus*len(subs))
return pd.DataFrame(ll, columns= (["SUBJECT","STUDENT"]))
df = df3.groupby(['CITY','STATE'])['SUBJECT','STUDENT'].apply(splitter).reset_index().drop('level_2', axis =1)
print df[[ 'SUBJECT', 'STUDENT', 'CITY','STATE' ]]
SUBJECT STUDENT CITY STATE
0 Geology John Boston MA
1 Physics John Boston MA
2 Math Sam LosAngeles CA
3 Chemistry Peter LosAngeles CA
4 Biology Mary LosAngeles CA