将spark数据帧聚合转换为SQL查询; window,groupby和如何聚合的问题?

时间:2018-05-28 00:29:02

标签: sql apache-spark

我正在搞乱来自Spark:The Definitive Guide的数据,而且我只是为了全面发展而使用Java。

我正在从CSV中正确读取数据并创建一个临时视图表,如下所示:

Dataset<Row> staticDataFrame = spark.read().format("csv").option("header","true").option("inferSchema","true").load("/data/retail-data/by-day/*.csv");

staticDataFrame.createOrReplaceTempView("SalesInfo");

spark.sql("SELECT CustomerID, (UnitPrice * Quantity) AS total_cost, InvoiceDate from SalesInfo").show(10);

这样可以正常工作并返回以下数据:

+----------+------------------+--------------------+
|CustomerID|        total_cost|         InvoiceDate|
+----------+------------------+--------------------+
|   14075.0|             85.92|2011-12-05 08:38:...|
|   14075.0|              25.0|2011-12-05 08:38:...|
|   14075.0|39.599999999999994|2011-12-05 08:38:...|
|   14075.0|              30.0|2011-12-05 08:38:...|
|   14075.0|15.299999999999999|2011-12-05 08:38:...|
|   14075.0|              40.8|2011-12-05 08:38:...|
|   14075.0|              39.6|2011-12-05 08:38:...|
|   14075.0|             40.56|2011-12-05 08:38:...|
|   18180.0|              17.0|2011-12-05 08:39:...|
|   18180.0|              17.0|2011-12-05 08:39:...|
+----------+------------------+--------------------+
only showing top 10 rows

当我尝试按CustomerID对其进行分组时遇到问题,但是当我尝试按CustomerID对其进行分组时,

spark.sql("SELECT CustomerID, (UnitPrice * Quantity) AS total_cost, InvoiceDate from SalesInfo GROUP BY CustomerID").show(10);

我明白了:

Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'salesinfo.`UnitPrice`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.

我得到了我做错的概念,即它不知道如何聚合total_cost和发票日期,但我仍然坚持如何在SQL方面做到这一点;这本书使用了火花聚合功能,并且这样做:

selectExpr(
"CustomerId",
"(UnitPrice * Quantity) as total_cost",
"InvoiceDate")

.groupBy(     col(“CustomerId”),窗口(col(“InvoiceDate”),“1天”))   的.sum( “TOTAL_COST”)

但是我试图理解如何用SQL语句作为一种学习练习。

对于如何通过查询执行此操作的任何帮助表示赞赏。

我正在试图找出如何做到这一点,我只是得到每个客户ID的总计,但然后如何获得该书的spark语句的全部功能,其中它是由客户ID分成几小时的总金额。

谢谢

编辑:这是数据的来源;我只是将其全部读入一个数据集

https://github.com/databricks/Spark-The-Definitive-Guide/tree/master/data/retail-data/by-day

2 个答案:

答案 0 :(得分:0)

所以,我在SQL中解释你所说的UnitPrice *每个客户每小时的数量:

select 
    customerid, 
    sum(unitprice * quantity) as total_cost, 
    cast(cast(InvoiceDate as date) as varchar) + ' ' + cast(DATEPART(HH, InvoiceDate) as varchar) + ':00'
from [retail-data] 
group by CustomerID, cast(cast(InvoiceDate as date) as varchar) + ' ' + cast(DATEPART(HH, InvoiceDate) as varchar) + ':00'

答案 1 :(得分:0)

要完成Geoff Hacker的回答,您可以在DataFrame对象上使用explain(true)方法来查看执行计划:

== Physical Plan ==
*(2) HashAggregate(keys=[CustomerId#16, window#41], functions=    [sum(total_cost#26)], output=[CustomerId#16, window#41, sum(total_cost)#35])
  +- Exchange hashpartitioning(CustomerId#16, window#41, 200)
    +- *(1) HashAggregate(keys=[CustomerId#16, window#41], functions=[partial_sum(total_cost#26)], output=[CustomerId#16, window#41, sum#43])
      +- *(1) Project [named_struct(start,     precisetimestampconversion(((((CASE WHEN     (cast(CEIL((cast((precisetimestampconversion(InvoiceDate#14, TimestampType,     LongType) - 0) as double) / 8.64E10)) as double) =     (cast((precisetimestampconversion(InvoiceDate#14, TimestampType, LongType) - 0) as double) / 8.64E10)) THEN (CEIL((cast((precisetimestampconversion(InvoiceDate#14, TimestampType, LongType) - 0) as double) / 8.64E10)) + 1) ELSE CEIL((cast((precisetimestampconversion(InvoiceDate#14, TimestampType, LongType) - 0) as double) / 8.64E10)) END + 0) - 1) * 86400000000) + 0), LongType, TimestampType), end, precisetimestampconversion(((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(InvoiceDate#14, TimestampType, LongType) - 0) as double) / 8.64E10)) as double) = (cast((precisetimestampconversion(InvoiceDate#14, TimestampType, LongType) - 0) as double) / 8.64E10)) THEN (CEIL((cast((precisetimestampconversion(InvoiceDate#14, TimestampType, LongType) - 0) as double) / 8.64E10)) + 1) ELSE CEIL((cast((precisetimestampconversion(InvoiceDate#14, TimestampType, LongType) - 0) as double) / 8.64E10)) END + 0) - 1) * 86400000000) + 86400000000), LongType, TimestampType)) AS window#41, CustomerId#16, (UnitPrice#15 * cast(Quantity#13 as double)) AS total_cost#26]
     +- *(1) Filter isnotnull(InvoiceDate#14)
        +- *(1) FileScan csv [Quantity#13,InvoiceDate#14,UnitPrice#15,CustomerID#16] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/tmp/spark/retail/2010-12-01.csv, file:/tmp/spar..., PartitionFilters: [], PushedFilters: [IsNotNull(InvoiceDate)], ReadSchema: struct<Quantity:int,InvoiceDate:timestamp,UnitPrice:double,CustomerID:double>

如您所见,Spark从CustomerId和窗口(每天00:00:00 - 23:59:59)[HashAggregate(keys=[CustomerId#16, window#41]]创建一个聚合键,并将带有这些键的所有行移动到一个分区中(Exchange hashpartitioning)。在分区之间移动数据的这一事实称为随机操作。稍后它会对这些累积的数据执行SUM(...)函数。

也就是说,带有1个键的GROUP BY表达式应该只为该键生成1行。因此,如果在初始查询中您在投影中将CustomerID定义为密钥并使用InvoiceDate定义total_cost,则引擎将无法为CustomerID获取1行,因为1 CustomerID可以具有多个InvoiceDates。关于SQL语言没有例外。