我正在将pyspark与dataframe配合使用,并希望创建如下的嵌套结构
之前:
Column 1 | Column 2 | Column 3
--------------------------------
A | B | 1
A | B | 2
A | C | 1
之后:
Column 1 | Column 4
--------------------------------
A | [B : [1,2]]
A | [C : [1]]
这可行吗?
答案 0 :(得分:1)
我认为您无法获得确切的输出,但是您可以接近。问题是列4的键名。在Spark中,结构需要具有一组固定的预先已知的列。但是让我们留待以后,首先是聚合:
import pyspark
from pyspark.sql import functions as F
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
data = [('A', 'B', 1), ('A', 'B', 2), ('A', 'C', 1)]
columns = ['Column1', 'Column2', 'Column3']
data = spark.createDataFrame(data, columns)
data.createOrReplaceTempView("data")
data.show()
# Result
+-------+-------+-------+
|Column1|Column2|Column3|
+-------+-------+-------+
| A| B| 1|
| A| B| 2|
| A| C| 1|
+-------+-------+-------+
nested = spark.sql("SELECT Column1, Column2, STRUCT(COLLECT_LIST(Column3) AS data) AS Column4 FROM data GROUP BY Column1, Column2")
nested.toJSON().collect()
# Result
['{"Column1":"A","Column2":"C","Column4":{"data":[1]}}',
'{"Column1":"A","Column2":"B","Column4":{"data":[1,2]}}']
您几乎想要什么,对吧?问题是,如果您事先不知道键名(即列2中的值),Spark将无法确定数据的结构。另外,我不确定是否可以使用列的值作为结构的键,除非您使用UDF(也许带有PIVOT
?):
datatype = 'struct<B:array<bigint>,C:array<bigint>>' # Add any other potential keys here.
@F.udf(datatype)
def replace_struct_name(column2_value, column4_value):
return {column2_value: column4_value['data']}
nested.withColumn('Column5', replace_struct_name(F.col("Column2"), F.col("Column4"))).toJSON().collect()
# Output
['{"Column1":"A","Column2":"C","Column4":{"C":[1]}}',
'{"Column1":"A","Column2":"B","Column4":{"B":[1,2]}}']
当然,这样做的缺点是键的数量必须是离散的,并且必须事先知道,否则其他键的值将被忽略。
答案 1 :(得分:0)
首先,是数据框的可复制示例。
js = [{"col1": "A", "col2":"B", "col3":1},{"col1": "A", "col2":"B", "col3":2},{"col1": "A", "col2":"C", "col3":1}]
jsrdd = sc.parallelize(js)
sqlContext = SQLContext(sc)
jsdf = sqlContext.read.json(jsrdd)
jsdf.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| B| 1|
| A| B| 2|
| A| C| 1|
+----+----+----+
现在,列表未存储为键值对。在column2上进行分组后,可以使用dictionary
或简单的collect_list()
。
jsdf.groupby(['col1', 'col2']).agg(F.collect_list('col3')).show()
+----+----+------------------+
|col1|col2|collect_list(col3)|
+----+----+------------------+
| A| C| [1]|
| A| B| [1, 2]|
+----+----+------------------+