如何在pyspark中获得百分比频率

时间:2017-10-04 22:02:11

标签: pyspark apache-spark-sql pyspark-sql

我想在pyspark获得百分比频率。我在python中这样做了如下

Execute()

获得频率非常简单:

Companies = df['Company'].value_counts(normalize = True)
# Dates in descending order of complaint frequency 
df.createOrReplaceTempView('Comp')
CompDF = spark.sql("SELECT Company, count(*) as cnt \
                    FROM Comp \
                    GROUP BY Company \
                    ORDER BY cnt DESC")
CompDF.show()

如何从此处获得百分比频率?我尝试了一堆运气不大的东西。 任何帮助将不胜感激。

2 个答案:

答案 0 :(得分:2)

正如Suresh在评论中所暗示的那样,假设total_count是数据框Companies中的行数,您可以使用withColumn添加名为percentages的新列CompDF

total_count = Companies.count()

df = CompDF.withColumn('percentage', CompDF.cnt/float(total_counts))

答案 1 :(得分:0)

可能正在修改SQL查询会得到你想要的结果。

    "SELECT Company,cnt/(SELECT SUM(cnt) from (SELECT Company, count(*) as cnt 
    FROM Comp GROUP BY Company ORDER BY cnt DESC) temp_tab) sum_freq from 
    (SELECT Company, count(*) as cnt FROM Comp GROUP BY Company ORDER BY cnt 
    DESC)"