Question

参考How do I select item with most count in a dataframe and define is as a variable in scala?

如下表所示，如何选择第n个src_ip并将其作为变量？

+--------------+------------+
|        src_ip|src_ip_count|
+--------------+------------+
|  58.242.83.11|          52|
|58.218.198.160|          33|
|58.218.198.175|          22|
|221.194.47.221|           6|

Answer 1

您可以使用行号创建另一个列

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val tempdf = df.withColumn("row_number", monotonically_increasing_id())
tempdf.withColumn("row_number", row_number().over(Window.orderBy("row_number")))

应该为tempdf提供

+--------------+------------+----------+
|        src_ip|src_ip_count|row_number|
+--------------+------------+----------+
|  58.242.83.11|          52|         1|
|58.218.198.160|          33|         2|
|58.218.198.175|          22|         3|
|221.194.47.221|           6|         4|
+--------------+------------+----------+

现在，您可以在{n} filter中使用row 过滤作为

  .filter($"row_number" === n)

应该是它。

为了提取ip，假设你的n是2

val n = 2

然后上面的过程会给你

+--------------+------------+----------+
|        src_ip|src_ip_count|row_number|
+--------------+------------+----------+
|58.218.198.160|          33|         2|
+--------*------+------------+----------+

获取IP地址*在您在问题中提供的链接中进行了解释

.head.get(0)

最安全的方法是使用转化为zipWithIndex的{{1}}中的dataframe，然后转换回rdd，以便我们拥有明确无误的dataframe列。

row_number

其他步骤之前已经解释过了。

如何在scala数据框中选择元素？

1 个答案: