pyspark dataframe - 为什么在以下场景中以不同方式识别空值?

时间:2017-06-15 21:24:05

标签: apache-spark null pyspark spark-dataframe

为什么isNull()在以下情况下表现不同?

  • PySpark 1.6
  • Python 2.6.6

两个数据帧的定义:

df_t1 = sqlContext.sql("select 1 id, 9 num union all select 1 id, 2 num union all select 2 id, 3 num")
df_t2 = sqlContext.sql("select 1 id, 1 start, 3 stop union all select 3 id, 1 start, 9 stop")

情景1:

df_t1.join(df_t2, (df_t1.id == df_t2.id) & (df_t1.num >= df_t2.start) & (df_t1.num <= df_t2.stop), "left").select([df_t2.start, df_t2.start.isNull()]).show()

输出1:

+-----+-------------+
|start|isnull(start)|
+-----+-------------+
| null|        false|
|    1|        false|
| null|        false|
+-----+-------------+

情景2:

df_new=df_t1.join(df_t2, (df_t1.id == df_t2.id) & (df_t1.num >= df_t2.start) & (df_t1.num <= df_t2.stop), "left")

输出2:

+-----+-------------+
|start|isnull(start)|
+-----+-------------+
| null|         true|
|    1|        false|
| null|         true|
+-----+-------------+

情景3:

df_t1.join(df_t2, (df_t1.id == df_t2.id) & (df_t1.num >= df_t2.start) & (df_t1.num <= df_t2.stop), "left").filter("start is null").show()

输出3:

+---+---+----+-----+----+
| id|num|  id|start|stop|
+---+---+----+-----+----+
|  1|  9|null| null|null|
|  2|  3|null| null|null|
+---+---+----+-----+----+

谢谢。

0 个答案:

没有答案