对不起,如果我的问题不清楚,那么在查询方面我就不好了。我认为用模式解释要实现的目标会容易得多。
#loading dataframes with csv files
cores = spark.read.format("csv").option("header","true").load(coreFile)
children = spark.read.format("csv").option("header","true").load(childFile)
#gets all attribute types (entire columns values)
childTypes = children.select("AttributeType").distinct().collect()
#creates new column for each child type
redimDF = cores
for childType in childTypes :
redimDF = redimDF.withColumn(childType['AttributeType'], lit(0))
我在数据砖集群中有两个数据框
第一个'redimDF':
+---+-----+-----+-------+-----+--+-----+-----+-------+------+-------+
|PId|SCode|PCode|LOYALTY|OFFER|VF|VENUE|GROUP|MISSION|REGION|GENERIC|
+---+-----+-----+-------+-----+--+-----+-----+-------+------+-------+
|663| 770| 30| 0| 0| 0| 0| 0| 0| 0| 0|
|527| 786| 32| 0| 0| 0| 0| 0| 0| 0| 0|
+---+-----+-----+-------+-----+--+-----+-----+-------+------+-------+
第二个,“孩子”:
+---+--------------+-------+
|PId| AttributeType| Value|
+---+--------------+-------+
|663| REGION| 6|
|663| LOYALTY| 0|
|663| OFFER| 0000|
|663| MISSION| D|
|663| VF| 77|
|663| VENUE| 20744|
|527| REGION| 4|
|527| LOYALTY| 0|
+---+--------------+-------+
我希望结果是这样的:
+---+-----+-----+-------+-----+--+-----+-----+-------+------+-------+
|PId|SCode|PCode|LOYALTY|OFFER|VF|VENUE|GROUP|MISSION|REGION|GENERIC|
+---+-----+-----+-------+-----+--+-----+-----+-------+------+-------+
|663| 770| 30| 0| 0000|77|20744| 0| D| 6| 0|
|527| 786| 32| 0| 0| 0| 0| 0| 0| 4| 0|
+---+-----+-----+-------+-----+--+-----+-----+-------+------+-------+
有没有一种方法可以使用pyspark查询来实现这一目标?
预先感谢
答案 0 :(得分:0)
这是使用数据透视的一种方法:
创建所需的数据框
import pyspark.sql.functions as F
redim = [(663,770, 30, 0, 0, 0), (527,786, 32, 0 ,0 ,0)]
redimDF = sqlContext.createDataFrame(redim, ["PId","SCode","PCode","LOYALTY","OFFER","VF"])
redimDF.show()
+---+-----+-----+-------+-----+---+
|PId|SCode|PCode|LOYALTY|OFFER| VF|
+---+-----+-----+-------+-----+---+
|663| 770| 30| 0| 0| 0|
|527| 786| 32| 0| 0| 0|
+---+-----+-----+-------+-----+---+
children = [(663,"LOYALTY",40),(663,"OFFER", 20),(527,"LOYALTY",40),(527,"VF", 20)]
childrenDF = sqlContext.createDataFrame(children, ["PId","AttributeType","Value"])
childrenDF .show()
+---+-------------+-----+
|PId|AttributeType|Value|
+---+-------------+-----+
|663| LOYALTY| 40|
|663| OFFER| 20|
|527| LOYALTY| 40|
|527| VF| 20|
+---+-------------+-----+
旋转childrenDF,并且当不是redimDF的所有attributeType类型都在childrenDF中时,将其添加并设置为0。
childrenDF = childrenDF.groupBy("PId").pivot("AttributeType").agg(F.sum(F.col("Value")))
for col in redimDF.columns:
if col not in childrenDF.columns:
childrenDF = childrenDF.withColumn(col, F.lit(0))
以与redimDF和联合相同的顺序选择列
childrenDF = childrenDF.select(redimDF.columns)
df = redimDF.union(childrenDF)
groupby和sum以获得结果df
df = df.groupBy("PId").agg(F.sum("SCode").alias("SCode"),
F.sum("PCode").alias("PCode"), F.sum("LOYALTY").alias("LOYALTY"),
F.sum("OFFER").alias("OFFER"), F.sum("VF").alias("VF"))
df.show()
+---+-----+-----+-------+-----+---+
|PId|SCode|PCode|LOYALTY|OFFER| VF|
+---+-----+-----+-------+-----+---+
|663| 770| 30| 40| 20| 0|
|527| 786| 32| 40| 0| 20|
+---+-----+-----+-------+-----+---+