我每年都有一个大的嵌套json文档(例如2018、2017),该文档按每月(1月至12月)和每天(1-31)汇总数据。
{
"2018" : {
"Jan": {
"1": {
"u": 1,
"n": 2
}
"2": {
"u": 4,
"n": 7
}
},
"Feb": {
"1": {
"u": 3,
"n": 2
},
"4": {
"u": 4,
"n": 5
}
}
}
}
我已使用AWS Glue Relationalize.apply函数将上述层次结构数据转换为平面结构:
dfc = Relationalize.apply(框架= datasource0,staging_path = my_temp_bucket,名称= my_ref_relationalize_table,transformation_ctx =“ dfc”)
哪个给了我表格,其中包含每个json元素的列,如下所示:
| 2018.Jan.1.u | 2018.Jan.1.n | 2018.Jan.2.u | 2018.Jan.1.n | 2018.Feb.1.u | 2018.Feb.1.n | 2018.Feb.2.u | 2018.Feb.1.n |
| 1 | 2 | 4 | 7 | 3 | 2 | 4 | 5 |
如您所见,表中每天和每个月都有很多列。而且,我想通过将列转换为行以简化表格来简化表格。
| year | month | dd | u | n |
| 2018 | Jan | 1 | 1 | 2 |
| 2018 | Jan | 2 | 4 | 7 |
| 2018 | Feb | 1 | 3 | 2 |
| 2018 | Jan | 4 | 4 | 5 |
通过搜索,我找不到正确的答案。是否有解决方案AWS Glue / PySpark或其他任何方法来完成取消透视功能以从基于列的表中获取基于行的表?可以在雅典娜做吗?
答案 0 :(得分:0)
类似于以下代码段的已实现解决方案
dataFrame = datasource0.toDF()
tableDataArray = [] ## to hold rows
rowArrayCount = 0
for row in dataFrame.rdd.toLocalIterator():
for colName in dataFrame.schema.names:
value = row[colName]
keyArray = colName.split('.')
rowDataArray = []
rowDataArray.insert(0,str(id))
rowDataArray.insert(1,str(keyArray[0]))
rowDataArray.insert(2,str(keyArray[1]))
rowDataArray.insert(3,str(keyArray[2]))
rowDataArray.insert(4,str(keyArray[3]))
tableDataArray.insert(rowArrayCount,rowDataArray)
rowArrayCount=+1
unpivotDF = None
for rowDataArray in tableDataArray:
newRowDF = sc.parallelize([Row(year=rowDataArray[0],month=rowDataArray[1],dd=rowDataArray[2],u=rowDataArray[3],n=rowDataArray[4])]).toDF()
if unpivotDF is None:
unpivotDF = newRowDF
else :
unpivotDF = unpivotDF.union(newRowDF)
datasource0 = datasource0.fromDF(unpivotDF, glueContext, "datasource0")
columns = [StructField('year',StringType(), True),StructField('month', IntegerType(), ....]
schema = StructType(columns)
unpivotDF = sqlContext.createDataFrame(sc.emptyRDD(), schema)
for rowDataArray in tableDataArray:
newRowDF = spark.createDataFrame(rowDataArray, schema)