我想在pyspark获得百分比频率。我在python中这样做了如下
Execute()
获得频率非常简单:
Companies = df['Company'].value_counts(normalize = True)
# Dates in descending order of complaint frequency
df.createOrReplaceTempView('Comp')
CompDF = spark.sql("SELECT Company, count(*) as cnt \
FROM Comp \
GROUP BY Company \
ORDER BY cnt DESC")
CompDF.show()
如何从此处获得百分比频率?我尝试了一堆运气不大的东西。 任何帮助将不胜感激。
答案 0 :(得分:2)
正如Suresh在评论中所暗示的那样,假设total_count
是数据框Companies
中的行数,您可以使用withColumn
添加名为percentages
的新列CompDF
:
total_count = Companies.count()
df = CompDF.withColumn('percentage', CompDF.cnt/float(total_counts))
答案 1 :(得分:0)
可能正在修改SQL查询会得到你想要的结果。
"SELECT Company,cnt/(SELECT SUM(cnt) from (SELECT Company, count(*) as cnt
FROM Comp GROUP BY Company ORDER BY cnt DESC) temp_tab) sum_freq from
(SELECT Company, count(*) as cnt FROM Comp GROUP BY Company ORDER BY cnt
DESC)"