我对pySpark非常陌生。感谢您的帮助。我有一个数据框
test["1"]={"vars":["x1","x2"]}
test["2"]={"vars":["x2"]}
test["3"]={"vars":["x3"]}
test["4"]={"vars":["x2","x3"]}
pdDF = pd.DataFrame(test).transpose()
sparkDF=spark.createDataFrame(pdDF)
+--------+
| vars|
+--------+
|[x1, x2]|
| [x2]|
| [x3]|
|[x2, x3]|
+--------+
我正在寻找一种通过列表中的值对“ vars”列进行分组并计数的方法 我正在寻找下一个结果:
+-----+---+
|count|var|
+-----+---+
| 1| x1|
| 3| x2|
| 2| x3|
+-----+---+
有人可以建议如何实现吗?
谢谢!
答案 0 :(得分:2)
from pyspark.sql.functions import explode
values = [(["x1","x2"],),(["x2"],),(["x3"],),(["x2","x3"],)]
df = sqlContext.createDataFrame(values,['vars'])
df.show()
+--------+
| vars|
+--------+
|[x1, x2]|
| [x2]|
| [x3]|
|[x2, x3]|
+--------+
newdf=df.withColumn("vars2", explode(df.vars))
newdf.groupBy('vars2').count().show()
+-----+-----+
|vars2|count|
+-----+-----+
| x2| 3|
| x3| 2|
| x1| 1|
+-----+-----+