Question

我有一个例如C78907的代码。我想分开它：

C78     # level 1
C789    # Level2
C7890   # Level 3
C78907  # Level 4

到目前为止我使用的是：

Df3 = Df2.withColumn('Level_One', concat(Df2.code.substr(1, 3)))
Df4 = Df3.withColumn('Level_two', concat(Df3.code.substr(1, 4)))
Df5 = Df4.withColumn('Level_theree', concat(Df4.code.substr(1, 5)))
Df6 = Df5.withColumn('Level_four', concat(Df5.code.substr(1, 6)))

问题是在查看结果时，第四级代码（应该是6个组件）可能包含第一级或第二级或第三级的代码。

721 7213    7213    7213
758 7580    7580    7580
724 7242    7242    7242
737 7373    73730   73730
789 7895    78959   78959
V06 V061    V061    V061
381 3810    38100   38100

理想情况下，限制可能有用。我的意思是：

对于第一级，只保留3个组件。
对于二级4组件而言不是很少。
对于三级5组件，而不是少。
对于四级6组件而言不是很少。
如果所需的组件数量不存在，则将null放入，而不是使用前一个组件。

所需的输出：

Initial_code   level1  level2   level3   level4        
 7213           721    7213     null      null
 7580           758    7580     null      null
 7242           724    7242     null      null
 73730          737    7373     73730     null
 38100D         381    3810     38100     38100D

Answer 1

您可以使用pyspark.sql.Column.when()和pyspark.sql.functions.length()来获得所需的输出。创建列时，检查子字符串是否具有正确的长度。如果没有，请使用pyspark.sql.functions.lit()将列设置为None。

例如：

import pyspark.sql.functions as f
df.withColumn('Level_One', f.when(
        f.length(f.col('code').substr(1, 3)) == 3,
        f.col('code').substr(1, 3)
    ).otherwise(f.lit(None)))\
    .withColumn('Level_Two', f.when(
        f.length(f.col('code').substr(1, 4)) == 4,
        f.col('code').substr(1, 4)
    ).otherwise(f.lit(None)))\
    .withColumn('Level_Three', f.when(
        f.length(f.col('code').substr(1, 5)) == 5,
        f.col('code').substr(1, 5)
    ).otherwise(f.lit(None)))\
    .withColumn('Level_Four', f.when(
        f.length(f.col('code').substr(1, 6)) == 6,
        f.col('code').substr(1, 6)
    ).otherwise(f.lit(None)))\
    .show()

输出：

+------+---------+---------+-----------+----------+
|  Code|Level_One|Level_Two|Level_Three|Level_Four|
+------+---------+---------+-----------+----------+
|  7213|      721|     7213|       null|      null|
|  7580|      758|     7580|       null|      null|
|  7242|      724|     7242|       null|      null|
| 73730|      737|     7373|      73730|      null|
|38100D|      381|     3810|      38100|    38100D|
+------+---------+---------+-----------+----------+

具有限制的子串（pyspark.sql.Column.substr）

1 个答案: