我在Spark SQL 1.5.2上面临一个奇怪的行为,我有以下数据:
val data: RDD[Row] = context.sparkContext.parallelize(Seq(
Seq("show", "1234", 1465989124000l),
Seq("show", "1235", 1465989124001l),
Seq("show", "1236", 1465989124003l),
Seq("show", "1237", 1465985524000l),
Seq("end", "1238", 1465985524001l),
Seq("show", "1239", 1465985524002l),
Seq("show", "1240", 1465985524003l),
Seq("show", "1241", 1465953124000l)
)).map(r => r :+ new DateTime(r(2)).getHourOfDay).map(Row.fromSeq)
val dataFrame: DataFrame = context.createDataFrame(data, StructType(Seq(
StructField("eventType", DataTypes.StringType, nullable = false),
StructField("eventId", DataTypes.StringType, nullable = false),
StructField("eventTime", DataTypes.LongType, nullable = false),
StructField("eventHour", DataTypes.IntegerType, nullable = false)
)))
运行以下查询时:
val dataFrame1: DataFrame = dataFrame
.cube("eventHour", "eventType")
.agg(count("eventType"), countDistinct("eventType"))
.where(expr("eventType is not null"))
我收到:
+---------+---------+----------------+-------------------------+
|eventHour|eventType|count(eventType)|COUNT(DISTINCT eventType)|
+---------+---------+----------------+-------------------------+
| 4| null| 1| 0|
| 4| show| 1| 1|
| 14| null| 3| 0|
| 13| end| 1| 1|
| null| end| 1| 1|
| 13| null| 4| 0|
| 14| show| 3| 1|
| null| null| 8| 0|
| 13| show| 3| 1|
| null| show| 7| 1|
+---------+---------+----------------+-------------------------+
我希望得到以下数据:
+---------+---------+----------------+-------------------------+
|eventHour|eventType|count(eventType)|COUNT(DISTINCT eventType)|
+---------+---------+----------------+-------------------------+
| 4| show| 1| 1|
| 13| end| 1| 1|
| null| end| 1| 1|
| 14| show| 3| 1|
| 13| show| 3| 1|
| null| show| 7| 1|
+---------+---------+----------------+-------------------------+
意味着过滤掉所有空值,但这并没有发生。我只是想过来检查空检查问题,因为实际的代码查询是动态构建的,并且订单是未知的(所以我不能使用汇总或分组)
我尝试使用isNull
并获得相同的结果。
目前我检查列类型并确定要进行的检查:
dataFrame.schema.find(_.name == this.name).map(_.dataType match {
case DataTypes.StringType => org.apache.spark.sql.functions.expr(s"length($name) > 0")
case DataTypes.LongType | DataTypes.ShortType | DataTypes.IntegerType | DataTypes.ByteType => expr(s"$name > 0 or $name <= 0")
case DataTypes.DoubleType | DataTypes.FloatType => expr(s"$name > 0.0 or $name <= 0.0")
case DataTypes.BooleanType => expr(s"$name = true or $name = false")
case DataTypes.DateType => expr(s"$name >= ${new Date(0)}")
丑陋但有效,除了空字符串被过滤掉的事实。