Question

我创建了一个临时数据框，如下所示：

var someDF = Seq(("1","1.2.3.4"), ("2","5.26.6.3")).toDF("s/n", "ip")

有没有办法从完整的IP地址中提取子网并将其放入新的“子网”列？

输出示例：

---------------------------
|s/N | ip       | subnet  |
---------------------------
|1   | 1.2.3.4  | 1.2.3.x |
|2   | 5.26.6.3 | 5.26.6.x|
---------------------------

Answer 1

您可以使用UDF执行此操作：

val getSubnet = udf((ip: String) => ip.split("\\.").init.mkString(".") + ".x")

val df = someDF.withColumn("subnet", getSubnet($"ip"))

哪个会给你这个数据帧：

+---+--------+--------+
|s/n|      ip|  subnet|
+---+--------+--------+
|  1| 1.2.3.4| 1.2.3.x|
|  2|5.26.6.3|5.26.6.x|
+---+--------+--------+

Answer 2

您可以使用concat_ws和substring_index inbuilt functions来满足您的要求。

import org.apache.spark.sql.functions._
someDF.withColumn("subnet", concat_ws(".", substring_index($"ip", ".", 3), lit("x")))

Answer 3

您可以尝试以下操作：非常简单的代码但会提高您的效果：

import org.apache.spark.sql.functions.{ concat, lit, col }

someDF.withColumn("subnet", concat(regexp_replace(col("ip"), "(.*\\.)\\d+$", "$1"), lit("x"))).show()

Output

+---+--------+--------+
|s/n|      ip|  subnet|
+---+--------+--------+
|  1| 1.2.3.4| 1.2.3.x|
|  2|5.26.6.3|5.26.6.x|
+---+--------+--------+

如何从Dataframe中的完整地址中提取子网？

3 个答案: