我有以下用于创建图表的代码片段。我想修改它在PySpark中工作,但我不知道如何继续。问题是我无法迭代PySpark中的一列,但我没有尝试将其变成一个函数。
上下文:DataFrame有一个名为public final class EleType<T> {
public static final EleType<Integer> INTEGER = new EleType<>(Integer.class,
rand -> rand.nextInt());
public static final EleType<Character> CHARACTER = new EleType<>(Character.class,
rand -> (char) (rand.nextInt(26) + 'a'));
private final Class<T> classType;
private final Function<Random, T> creator;
private EleType(Class<T> classType, Function<Random, T> creator) {
this.classType = classType;
this.creator = creator;
}
// Getters
}
的列,它只是一个字符串的城市名称
City
我的目标是发送此cities = [i.City for i in df.select('City').distinct().collect()]
stack = []
for city in cities:
df = sqlContext.sql( 'SELECT Complaint Type, COUNT(*) as `counts` '
'FROM c311 '
'WHERE City = "{}" COLLATE NOCASE '
'GROUP BY `Complaint Type` '
'ORDER BY counts DESC'.format(city))
stack.append(Bar(x=df['Complaint Type'], y=df.counts, name=city.capitalize()))
并在本地绘制图表。但是我遇到了toPandas()
以来的错误。我如何处理PySpark?
答案 0 :(得分:1)
你可以:
from pyspark.sql.functions import upper, col
pdf = df.withColumn("city", upper(col("city"))) \
.groupBy("Complaint Type").pivot("city").count() \
.toPandas()
(或按city
分组并按type
转动)并从那里使用。