包含空值的窗口上的MySQL总和返回null

时间:2019-01-18 16:18:28

标签: apache-spark null apache-spark-sql window-functions

我正在尝试获取每个客户最近3个月(不包括当前行)行的总收入。目前在Databricks中尝试的最小示例:

cols = ['Client','Month','Revenue']
df_pd = pd.DataFrame([['A',201701,100],
                   ['A',201702,101],
                   ['A',201703,102],
                   ['A',201704,103],
                   ['A',201705,104],
                   ['B',201701,201],
                   ['B',201702,np.nan],
                   ['B',201703,203],
                   ['B',201704,204],
                   ['B',201705,205],
                   ['B',201706,206],
                   ['B',201707,207]                
                  ])
df_pd.columns = cols

spark_df = spark.createDataFrame(df_pd)
spark_df.createOrReplaceTempView('df_sql')

df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
  order by Client,Month
  rows between 3 preceding and 1 preceding)) as Total_Sum3
  from df_sql
  """)
df_out.show()

+------+------+-------+----------+
|Client| Month|Revenue|Total_Sum3|
+------+------+-------+----------+
|     A|201701|  100.0|      null|
|     A|201702|  101.0|     100.0|
|     A|201703|  102.0|     201.0|
|     A|201704|  103.0|     303.0|
|     A|201705|  104.0|     306.0|
|     B|201701|  201.0|      null|
|     B|201702|    NaN|     201.0|
|     B|201703|  203.0|       NaN|
|     B|201704|  204.0|       NaN|
|     B|201705|  205.0|       NaN|
|     B|201706|  206.0|     612.0|
|     B|201707|  207.0|     615.0|
+------+------+-------+----------+

如您所见,如果3个月窗口中的任何地方都存在空值,则会返回一个空值。我想将null视为0,因此尝试ifnull,但这似乎不起作用。我也尝试过一个case语句,将NULL更改为0,但是没有运气。

2 个答案:

答案 0 :(得分:0)

只是coalesce外的总和:

df_out = sqlContext.sql("""
  select *, coalesce(sum(Revenue) over (partition by Client
  order by Client,Month
  rows between 3 preceding and 1 preceding)), 0) as Total_Sum3
  from df_sql
 """)

答案 1 :(得分:0)

这是Apache Spark,我不好! (我在Databricks工作,我认为这是MySQL的幕后知识)。更改标题是否为时已晚?

@Barmar,您说对了,因为Get: 0.09824066666666667 Type: <class 'float'> Want: 0.09824066666666668 Type: <class 'numpy.float64'> 不会将IFNULL()视为NaN。感谢@ user6910411,从这里SO link,我设法找出了解决方法。我不得不更改numpy的NaN来引发空值。创建示例df_pd之后的正确代码:

null

然后给出所需的内容:

spark_df = spark.createDataFrame(df_pd)

from pyspark.sql.functions import isnan, col, when

#this converts all NaNs in numeric columns to null:
spark_df = spark_df.select([
    when(~isnan(c), col(c)).alias(c) if t in ("double", "float") else c 
    for c, t in spark_df.dtypes])

spark_df.createOrReplaceTempView('df_sql')

df_out = sqlContext.sql("""
select *, (sum(ifnull(Revenue,0)) over (partition by Client
  order by Client,Month
  rows between 3 preceding and 1 preceding)) as Total_Sum3
  from df_sql order by Client,Month
  """)
df_out.show()

sqlContext是解决此问题的最佳方法,还是通过pyspark.sql.window获得相同结果会更好/更优雅?