我有一个包含3个网络的数据框,每个网络有许多工作站。我想做的是获取每个网络的站点总数。数据框仍应包含网络和站点名称,因此应如下所示:
Network Station Total
XMN DIS 3
XMN CNN 3
XMN JFK 3
ALK DIS 2
ALK CNN 2
我将如何去做?我假设我需要对列进行分组,然后使用窗口函数按网络和站点进行分区以获取总数?我不确定,但是我该怎么做呢?
答案 0 :(得分:1)
Window.partitionBy
正是这样做的:
df = spark_session.createDataFrame([
Row(Network='XMN', Station='DIS'),
Row(Network='XMN', Station='CNN'),
Row(Network='XMN', Station='JFK'),
Row(Network='ALK', Station='DIS'),
Row(Network='ALK', Station='CNN')
])
df.select("Network", "Station", count("*").over(Window.partitionBy("Network")).alias("Total")).show()
输出:
.+-------+-------+-----+
|Network|Station|Total|
+-------+-------+-----+
| XMN| DIS| 3|
| XMN| CNN| 3|
| XMN| JFK| 3|
| ALK| DIS| 2|
| ALK| CNN| 2|
+-------+-------+-----+
答案 1 :(得分:0)
您需要分组,获取计数并重新加入原始数据框
scala> val df = Seq(("XMN", "DIS"), ("XMN", "CNN"), ("XMN", "JFK"), ("ALK", "DIS"), ("ALK", "CNN")).toDF("Network", "Station")
df: org.apache.spark.sql.DataFrame = [Network: string, Station: string]
scala> df.show
+-------+-------+
|Network|Station|
+-------+-------+
| XMN| DIS|
| XMN| CNN|
| XMN| JFK|
| ALK| DIS|
| ALK| CNN|
+-------+-------+
scala> val grpCountDF = df.groupBy("Network").count
grpCountDF: org.apache.spark.sql.DataFrame = [Network: string, count: bigint]
scala> grpCountDF.show
+-------+-----+
|Network|count|
+-------+-----+
| XMN| 3|
| ALK| 2|
+-------+-----+
scala> val outputDF = df.join(grpCountDF, "Network")
outputDF: org.apache.spark.sql.DataFrame = [Network: string, Station: string ... 1 more field]
scala> outputDF.show
+-------+-------+-----+
|Network|Station|count|
+-------+-------+-----+
| XMN| DIS| 3|
| XMN| CNN| 3|
| XMN| JFK| 3|
| ALK| DIS| 2|
| ALK| CNN| 2|
+-------+-------+-----+