将没有分隔符的带有文本数据的列拆分为单独的列

时间:2020-09-25 21:02:15

标签: python pandas pyspark

我的数据集在数据框(df1)中:

enter image description here

想将df1上方的event_type字段分成最多20列的每个字母的单独列,每列仅包含一个字母;

我在Pandas中尝试过的有效的方法:

import pandas as pd
df2 = df1.toPandas()
df3 = df2.head().event_type.apply(lambda x: pd.Series(list(x), index=[['step_'+str(col) for col in np.arange(0, 20, 1)]]))

enter image description here

问题:

  • 当我删除head()以将公式应用于整个列event_type时,长度出现错误 的传递值(ValueError:传递的值的长度为11,索引暗含20)
  • 我还希望从主数据帧df1到最终数据帧df3的所有三个原始列以及拆分值

最终数据集应如下所示:

enter image description here

尝试:

df2 = df1.event_type.str.split('', expand=True)
df2.head()

有效的最终代码:

df2 = pd.concat([pd.DataFrame(df1.event_type.map(list).tolist()).add_prefix('step'), df1.event_type], axis =1)

将其与原始数据框合并回去

df3 = pd.merge(df1, df2, how = 'inner', on = ['event_type'])

最终O / P:df3.head()

enter image description here

3 个答案:

答案 0 :(得分:2)

使用火花列拆分功能。

df  = df.withColumn('len', length('event_type'))
df2 = df.withColumn('temp', rpad('event_type', 20, ' ')) \
  .select(*df.columns, *map(lambda i: split('temp', '|')[i].alias('step_' + str(i+1)), range(0, 20))) \
  .drop('temp')

df2.show(10, False)

+---+-----------------+---+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|id |event_type       |len|step_1|step_2|step_3|step_4|step_5|step_6|step_7|step_8|step_9|step_10|step_11|step_12|step_13|step_14|step_15|step_16|step_17|step_18|step_19|step_20|
+---+-----------------+---+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|1  |asdferwafaryt    |13 |a     |s     |d     |f     |e     |r     |w     |a     |f     |a      |r      |y      |t      |       |       |       |       |       |       |       |
|2  |aweraadsf        |9  |a     |w     |e     |r     |a     |a     |d     |s     |f     |       |       |       |       |       |       |       |       |       |       |       |
|3  |adsfwelklkjasdklj|17 |a     |d     |s     |f     |w     |e     |l     |k     |l     |k      |j      |a      |s      |d      |k      |l      |j      |       |       |       |
|4  |adlflkjwelkj     |12 |a     |d     |l     |f     |l     |k     |j     |w     |e     |l      |k      |j      |       |       |       |       |       |       |       |       |
|5  |lafelkjcfpoiucz  |15 |l     |a     |f     |e     |l     |k     |j     |c     |f     |p      |o      |i      |u      |c      |z      |       |       |       |       |       |
+---+-----------------+---+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

答案 1 :(得分:1)

尝试map并构造数据框并添加add_prefix

示例df

Out[26]:
   id             event_type
0   0                  cfaac
1   1                  fdceb
2   2    cdceeedaffcbaefeefa
3   3                 cbccbe
4   4            ecfcffddbad
5   5                   eebd
6   6  eecffbcbaffcbfecceffe
7   7      abdaadacdeeaaaceb
8   8      ceadfacedfbbbaceb
9   9          bbbcfaacbffdf

pd.DataFrame(df.event_type.map(list).tolist()).add_prefix('step')

Out[25]:
  step0 step1 step2 step3 step4 step5 step6 step7 step8 step9 step10 step11  \
0     c     f     a     a     c  None  None  None  None  None   None   None
1     f     d     c     e     b  None  None  None  None  None   None   None
2     c     d     c     e     e     e     d     a     f     f      c      b
3     c     b     c     c     b     e  None  None  None  None   None   None
4     e     c     f     c     f     f     d     d     b     a      d   None
5     e     e     b     d  None  None  None  None  None  None   None   None
6     e     e     c     f     f     b     c     b     a     f      f      c
7     a     b     d     a     a     d     a     c     d     e      e      a
8     c     e     a     d     f     a     c     e     d     f      b      b
9     b     b     b     c     f     a     a     c     b     f      f      d

  step12 step13 step14 step15 step16 step17 step18 step19 step20
0   None   None   None   None   None   None   None   None   None
1   None   None   None   None   None   None   None   None   None
2      a      e      f      e      e      f      a   None   None
3   None   None   None   None   None   None   None   None   None
4   None   None   None   None   None   None   None   None   None
5   None   None   None   None   None   None   None   None   None
6      b      f      e      c      c      e      f      f      e
7      a      a      c      e      b   None   None   None   None
8      b      a      c      e      b   None   None   None   None
9      f   None   None   None   None   None   None   None   None

答案 2 :(得分:0)

尝试df2 = df1.event_type.str.split('', expand=True)

然后将结果与其他(“ id”,“ len”)列合并。