Question

您好我有一个使用hdfs，hive，spark技术的项目。当我导入数据时，对于数字字段，如果数据不存在，则将替换为null。但对于字符串，它将被空字符串“”替换。为了解决这个问题，我在hive中创建表时使用了这一行。

TBLPROPERTIES('serialization.null.format'='');

但是当我将其转换为spark数据帧时，空字符串表示为“”而不是null

可能是什么原因......？蜂巢中的某些属性是否不支持spark？？

Answer 1

@Manu，

请将此作为转换问题的摘要用于火花数据框：

## Create a sample DataFrame
testDF = sqlContext.createDataFrame([Row(col1='foo', col2=1), Row(col1='', col2=2), Row(col1=None, col2='')])

这些函数会根据您的请求将所有空字符串转换为null

 def blank_if_null(z):
        return when(col(z) != "", col(z)).otherwise(None)

dfWithEmptyReplaced = testDF.withColumn("col1", blank_as_null("col1"))

在hadoop和spark中将“”字符串替换为null

1 个答案:

这些函数会根据您的请求将所有空字符串转换为null