修改PySpark DataFrame的Pandas代码

时间:2016-12-12 21:54:09

标签: apache-spark pyspark pyspark-sql

我有以下用于创建图表的代码片段。我想修改它在PySpark中工作,但我不知道如何继续。问题是我无法迭代PySpark中的一列,但我没有尝试将其变成一个函数。

上下文:DataFrame有一个名为public final class EleType<T> { public static final EleType<Integer> INTEGER = new EleType<>(Integer.class, rand -> rand.nextInt()); public static final EleType<Character> CHARACTER = new EleType<>(Character.class, rand -> (char) (rand.nextInt(26) + 'a')); private final Class<T> classType; private final Function<Random, T> creator; private EleType(Class<T> classType, Function<Random, T> creator) { this.classType = classType; this.creator = creator; } // Getters } 的列,它只是一个字符串的城市名称

City

我的目标是发送此cities = [i.City for i in df.select('City').distinct().collect()] stack = [] for city in cities: df = sqlContext.sql( 'SELECT Complaint Type, COUNT(*) as `counts` ' 'FROM c311 ' 'WHERE City = "{}" COLLATE NOCASE ' 'GROUP BY `Complaint Type` ' 'ORDER BY counts DESC'.format(city)) stack.append(Bar(x=df['Complaint Type'], y=df.counts, name=city.capitalize())) 并在本地绘制图表。但是我遇到了toPandas()以来的错误。我如何处理PySpark?

1 个答案:

答案 0 :(得分:1)

你可以:

from pyspark.sql.functions import upper, col

pdf = df.withColumn("city", upper(col("city"))) \
    .groupBy("Complaint Type").pivot("city").count() \
    .toPandas()

(或按city分组并按type转动)并从那里使用。