PySpark-按阵列分组

时间:2019-10-10 08:28:50

标签: python apache-spark pyspark

我对pySpark非常陌生。感谢您的帮助。我有一个数据框

test["1"]={"vars":["x1","x2"]}
test["2"]={"vars":["x2"]}
test["3"]={"vars":["x3"]}
test["4"]={"vars":["x2","x3"]}
pdDF = pd.DataFrame(test).transpose()
sparkDF=spark.createDataFrame(pdDF) 

+--------+
|    vars|
+--------+
|[x1, x2]|
|    [x2]|
|    [x3]|
|[x2, x3]|
+--------+

我正在寻找一种通过列表中的值对“ vars”列进行分组并计数的方法 我正在寻找下一个结果:


+-----+---+
|count|var|
+-----+---+
|    1| x1|
|    3| x2|
|    2| x3|
+-----+---+

有人可以建议如何实现吗?

谢谢!

1 个答案:

答案 0 :(得分:2)

from pyspark.sql.functions import explode
values = [(["x1","x2"],),(["x2"],),(["x3"],),(["x2","x3"],)]
df = sqlContext.createDataFrame(values,['vars'])
df.show()

+--------+
|    vars|
+--------+
|[x1, x2]|
|    [x2]|
|    [x3]|
|[x2, x3]|
+--------+

newdf=df.withColumn("vars2", explode(df.vars))
newdf.groupBy('vars2').count().show()

+-----+-----+
|vars2|count|
+-----+-----+
|   x2|    3|
|   x3|    2|
|   x1|    1|
+-----+-----+