Question

我正在Hadoop的Hive表上工作，并使用PySpark处理数据。我读了数据集：

dt = sqlContext.sql('select * from  db.table1')
df.select("var1").printSchema()
|-- var1: string (nullable = true)

在数据集中有一些Spark似乎无法识别的空值！我可以通过

轻松找到Null值

df.where(F.isNull(F.col("var1"))).count()
10163101

但是当我使用

df.where(F.col("var1")=='').count()

它给我零，但是当我签入sql时，我有6908个空值。

以下是SQL查询及其结果：

SELECT count(*)
FROM [Y].[dbo].[table1]
where var1=''

6908

和

SELECT count(*)
FROM [Y].[dbo].[table1]
where var1 is null

10163101

SQL和Pyspark表的计数相同：

df.count()
10171109

和

SELECT count(*)
FROM [Y].[dbo].[table1]
10171109

当我尝试使用长度或大小查找空白时，出现错误：

dt.where(F.size(F.col("var1")) == 0).count()

AnalysisException: "cannot resolve 'size(var1)' due to data type 
mismatch: argument 1 requires (array or map) type, however, 'var1' 
is of string type.;"

我应如何解决此问题？我的Spark版本是“ 1.6.3”

谢谢

Answer 1

我尝试了regexp，终于找到了空白！

public class MappedProperty
{
    public MappedProperty(PropertyInfo source)
    {
        this.Info = source;
        this.Source = source.Name;
        this.Target = source.GetCustomAttribute<JsonPropertyAttribute>()?.PropertyName ?? source.Name;
    }
    public PropertyInfo Info { get; }
    public string Source { get; }
    public string Target { get; }
}

在PySpark中将空白替换为空

1 个答案: