我的数据集在数据框(df1)中:
想将df1上方的event_type字段分成最多20列的每个字母的单独列,每列仅包含一个字母;
我在Pandas中尝试过的有效的方法:
import pandas as pd
df2 = df1.toPandas()
df3 = df2.head().event_type.apply(lambda x: pd.Series(list(x), index=[['step_'+str(col) for col in np.arange(0, 20, 1)]]))
问题:
最终数据集应如下所示:
尝试:
df2 = df1.event_type.str.split('', expand=True)
df2.head()
有效的最终代码:
df2 = pd.concat([pd.DataFrame(df1.event_type.map(list).tolist()).add_prefix('step'), df1.event_type], axis =1)
将其与原始数据框合并回去
df3 = pd.merge(df1, df2, how = 'inner', on = ['event_type'])
最终O / P:df3.head()
答案 0 :(得分:2)
使用火花列拆分功能。
df = df.withColumn('len', length('event_type'))
df2 = df.withColumn('temp', rpad('event_type', 20, ' ')) \
.select(*df.columns, *map(lambda i: split('temp', '|')[i].alias('step_' + str(i+1)), range(0, 20))) \
.drop('temp')
df2.show(10, False)
+---+-----------------+---+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|id |event_type |len|step_1|step_2|step_3|step_4|step_5|step_6|step_7|step_8|step_9|step_10|step_11|step_12|step_13|step_14|step_15|step_16|step_17|step_18|step_19|step_20|
+---+-----------------+---+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|1 |asdferwafaryt |13 |a |s |d |f |e |r |w |a |f |a |r |y |t | | | | | | | |
|2 |aweraadsf |9 |a |w |e |r |a |a |d |s |f | | | | | | | | | | | |
|3 |adsfwelklkjasdklj|17 |a |d |s |f |w |e |l |k |l |k |j |a |s |d |k |l |j | | | |
|4 |adlflkjwelkj |12 |a |d |l |f |l |k |j |w |e |l |k |j | | | | | | | | |
|5 |lafelkjcfpoiucz |15 |l |a |f |e |l |k |j |c |f |p |o |i |u |c |z | | | | | |
+---+-----------------+---+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
答案 1 :(得分:1)
尝试map
并构造数据框并添加add_prefix
示例df
:
Out[26]:
id event_type
0 0 cfaac
1 1 fdceb
2 2 cdceeedaffcbaefeefa
3 3 cbccbe
4 4 ecfcffddbad
5 5 eebd
6 6 eecffbcbaffcbfecceffe
7 7 abdaadacdeeaaaceb
8 8 ceadfacedfbbbaceb
9 9 bbbcfaacbffdf
pd.DataFrame(df.event_type.map(list).tolist()).add_prefix('step')
Out[25]:
step0 step1 step2 step3 step4 step5 step6 step7 step8 step9 step10 step11 \
0 c f a a c None None None None None None None
1 f d c e b None None None None None None None
2 c d c e e e d a f f c b
3 c b c c b e None None None None None None
4 e c f c f f d d b a d None
5 e e b d None None None None None None None None
6 e e c f f b c b a f f c
7 a b d a a d a c d e e a
8 c e a d f a c e d f b b
9 b b b c f a a c b f f d
step12 step13 step14 step15 step16 step17 step18 step19 step20
0 None None None None None None None None None
1 None None None None None None None None None
2 a e f e e f a None None
3 None None None None None None None None None
4 None None None None None None None None None
5 None None None None None None None None None
6 b f e c c e f f e
7 a a c e b None None None None
8 b a c e b None None None None
9 f None None None None None None None None
答案 2 :(得分:0)
尝试df2 = df1.event_type.str.split('', expand=True)
然后将结果与其他(“ id”,“ len”)列合并。