获取分组列的计数

时间:2019-05-06 18:02:08

标签: python apache-spark pyspark

我有一个包含3个网络的数据框,每个网络有许多工作站。我想做的是获取每个网络的站点总数。数据框仍应包含网络和站点名称,因此应如下所示:

Network Station Total
XMN     DIS     3     
XMN     CNN     3
XMN     JFK     3
ALK     DIS     2
ALK     CNN     2

我将如何去做?我假设我需要对列进行分组,然后使用窗口函数按网络和站点进行分区以获取总数?我不确定,但是我该怎么做呢?

2 个答案:

答案 0 :(得分:1)

Window.partitionBy正是这样做的:

df = spark_session.createDataFrame([
    Row(Network='XMN', Station='DIS'),
    Row(Network='XMN', Station='CNN'),
    Row(Network='XMN', Station='JFK'),
    Row(Network='ALK', Station='DIS'),
    Row(Network='ALK', Station='CNN')
])

df.select("Network", "Station", count("*").over(Window.partitionBy("Network")).alias("Total")).show()

输出:

.+-------+-------+-----+
|Network|Station|Total|
+-------+-------+-----+
|    XMN|    DIS|    3|
|    XMN|    CNN|    3|
|    XMN|    JFK|    3|
|    ALK|    DIS|    2|
|    ALK|    CNN|    2|
+-------+-------+-----+

答案 1 :(得分:0)

您需要分组,获取计数并重新加入原始数据框

scala> val df = Seq(("XMN", "DIS"), ("XMN", "CNN"), ("XMN", "JFK"), ("ALK", "DIS"), ("ALK", "CNN")).toDF("Network", "Station")
df: org.apache.spark.sql.DataFrame = [Network: string, Station: string]

scala> df.show
+-------+-------+
|Network|Station|
+-------+-------+
|    XMN|    DIS|
|    XMN|    CNN|
|    XMN|    JFK|
|    ALK|    DIS|
|    ALK|    CNN|
+-------+-------+


scala> val grpCountDF = df.groupBy("Network").count
grpCountDF: org.apache.spark.sql.DataFrame = [Network: string, count: bigint]

scala> grpCountDF.show
+-------+-----+
|Network|count|
+-------+-----+
|    XMN|    3|
|    ALK|    2|
+-------+-----+


scala> val outputDF = df.join(grpCountDF, "Network")
outputDF: org.apache.spark.sql.DataFrame = [Network: string, Station: string ... 1 more field]

scala> outputDF.show
+-------+-------+-----+
|Network|Station|count|
+-------+-------+-----+
|    XMN|    DIS|    3|
|    XMN|    CNN|    3|
|    XMN|    JFK|    3|
|    ALK|    DIS|    2|
|    ALK|    CNN|    2|
+-------+-------+-----+