数据集包含一个多维数组列,列之间具有父级和子级关系,需要对其进行汇总。
数据(示例)-
|consumerId| device | impressions
-----------------------------------------------------------------------------------------------------------------------------------------------------------
|123abc | Phone | [[1123, 'container', 1, 0, [10:00 AM]],[1124, 'item', 2, 1, [10:01 AM, 10:02 AM]],[1125, 'container', 3, 0, [10:05 AM]]]
|123abc | TV | [[1123, 'container', 1, 0, [11:00 AM]],[1128, 'item', 2, 1, [11:01 AM, 11:12 AM]],[1129, 'item', 3, 1, [11:05 AM]]
|123abc | Phone | [[1130, 'container', 1, 0, [12:00 AM]],[1131, 'item', 2, 1, [12:01 AM]]]
输入模式-
|-- consumerId: string
|-- platform: string
|-- impressions: array
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true) // same "id" will be treated as "containerId" and "itemId"
| | |-- impressionType: string (nullable = true) // Decide factor "container" or "item"
| | |-- impressionId: long (nullable = true) // postion - item will be identified by "impressions.impressionParentId = impressions.impressionId“
| | |-- impressionParentId: long (nullable = true) // relation between container & it's items as above
| | |-- impressionTimes: array (nullable = true) // Need to check Max time
| | | |-- element: long (containsNull = true)
输出架构-
|-- consumerId: string
|-- platform: string
|-- impressions: array
| |-- containerId: integer // If impressions.impressionType='container' then it's "id" will containerID
| |-- itemImpressions: array // List of all impressions id (as itemId) where "impressionParentId = conainer.impressionId“
| | |-- itemId: long // "impressions.id" if "impressions.impressionParentId = impressions.impressionId“
| | |-- lastImpressionTime: long // max(impressionTimes)
| | |-- impressionCount: int
几点-
每一行都包含多维数组字段-展示次数
展示次数 同时包含容器和相关的项目,这些关系可以通过{{ 1}} [我相信它需要做自我连接才能建立关系]
impressions.impressionParentId = impressions.impressionId
,则其“ id”将在输出模式中 containerID 上面的示例来自已过滤的数据集,我尝试爆炸并聚合,但无法正常工作。