从多维数组列中产生汇总值

时间:2019-07-07 07:12:56

标签: apache-spark apache-spark-sql

数据集包含一个多维数组列,列之间具有父级和子级关系,需要对其进行汇总。

数据(示例)-

|consumerId|    device  |       impressions        
 -----------------------------------------------------------------------------------------------------------------------------------------------------------
 |123abc    |   Phone   |  [[1123, 'container', 1, 0, [10:00 AM]],[1124, 'item', 2, 1, [10:01 AM, 10:02 AM]],[1125, 'container', 3, 0, [10:05 AM]]]
 |123abc    |   TV      |  [[1123, 'container', 1, 0, [11:00 AM]],[1128, 'item', 2, 1, [11:01 AM, 11:12 AM]],[1129, 'item', 3, 1, [11:05 AM]]
 |123abc    |   Phone   |  [[1130, 'container', 1, 0, [12:00 AM]],[1131, 'item', 2, 1, [12:01 AM]]]

输入模式-

 |-- consumerId: string
 |-- platform: string 
 |-- impressions: array
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)                 // same "id" will be treated as "containerId" and "itemId"
 |    |    |-- impressionType: string (nullable = true)     // Decide factor "container" or "item"
 |    |    |-- impressionId: long (nullable = true)         // postion - item will be identified by "impressions.impressionParentId = impressions.impressionId“ 
 |    |    |-- impressionParentId: long (nullable = true)   // relation between container & it's items as above
 |    |    |-- impressionTimes: array (nullable = true)     // Need to check Max time
 |    |    |    |-- element: long (containsNull = true)

输出架构-

 |-- consumerId: string
 |-- platform: string 
 |-- impressions: array                
 |   |-- containerId: integer          // If impressions.impressionType='container' then it's "id" will containerID
 |   |-- itemImpressions: array        // List of all impressions id (as itemId) where "impressionParentId = conainer.impressionId“
 |   |   |-- itemId: long              // "impressions.id" if  "impressions.impressionParentId = impressions.impressionId“
 |   |   |-- lastImpressionTime: long  // max(impressionTimes)
 |   |   |-- impressionCount: int    
  

几点-

  1. 每一行都包含多维数组字段-展示次数

  2. 展示次数 同时包含容器和相关的项目,这些关系可以通过{{ 1}} [我相信它需要做自我连接才能建立关系]

  3. 如果爆炸所有记录,则无法建立容器,并且其与 impressionId 的项目关系与每一行中的位置相同。
  4. 如果impressions.impressionParentId = impressions.impressionId,则其“ id”将在输出模式中 containerID
  5. 三个聚合级别-
    • 首先通过 consumerId 和平台
    • 其次是 containerId
    • itemId ,总数和Max(impressionTimes)

上面的示例来自已过滤的数据集,我尝试爆炸并聚合,但无法正常工作。

0 个答案:

没有答案