当字段值为空或len(field.stripe('\ t'))== 0时,将字符串字段列替换为null

时间:2017-05-22 16:11:19

标签: apache-spark dataframe pyspark apache-zeppelin

%spark.pyspark
l = [('user1', 33, 1.0, 'chess'), ('user2', 34, 2.0, 'tenis'), ('user3', None, None, ''), ('user4', None, 4.0, '   '), ('user5', None, 5.0, 'ski')]
df = spark.createDataFrame(l, ['name', 'age', 'ratio', 'hobby'])
df.show()

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- ratio: double (nullable = true)
 |-- hobby: string (nullable = true)
+-----+----+-----+-----+
| name| age|ratio|hobby|
+-----+----+-----+-----+
|user1|  33|  1.0|chess|
|user2|  34|  2.0|tenis|
|user3|null| null|     |
|user4|null|  4.0|     |
|user5|null|  5.0|  ski|
+-----+----+-----+-----+

当字段值为空或len(field.stripe('\ t'))== 0时,我想将字符串字段列替换为null。在我的情况下,'hobby'列空插槽应替换为空值。任何提示?

1 个答案:

答案 0 :(得分:0)

您可以将空bu null填充为

df.withColumn("hobby", blank_as_null("hobby"))

用于检查len(field.stripe(' \t')) == 0 你可以使用UDF

def replace(column, value):
    return when(len(column.stripe(' \t')) == 0, column).otherwise(lit(None))

df.withColumn("y", replace(col("y"), null)).show()