Question

我在使用Spark的情况下遇到问题，我有一个DataFrame，其中一列包含一个带有起始值和结束值的数组，例如：

[1000, 1010]

想知道如何创建＆amp; compute另一列包含一个包含给定范围的所有值的数组？生成的范围值列的结果将是：

    +--------------+-------------+-----------------------------+
    |   Description|     Accounts|                        Range|
    +--------------+-------------+-----------------------------+
    |       Range 1|   [101, 105]|    [101, 102, 103, 104, 105]|
    |       Range 2|   [200, 203]|         [200, 201, 202, 203]|
    +--------------+-------------+-----------------------------+

提前致谢

Answer 1

你必须为此创建一个UDF。

df.show
+-----------+----------+
|Description|  Accounts|
+-----------+----------+
|    Range 1|[100, 105]|
|    Range 2|[200, 203]|
+-----------+----------+

我试图在这里涵盖一些可能的边缘情况。如果您发现任何遗漏，可以添加更多内容。

val createRange = udf{ (xs: Seq[Int]) => 
    if(xs.length == 0 ) Array[Int]()
    else if (xs.length == 1) (0 to xs(0) ).toArray
    else (xs(0) to xs(1) ).toArray 
}

在您的Dataframe上调用此UDF createRange并传递数组Accounts

df.withColumn("Range" , createRange($"Accounts") ).show(false)
+-----------+----------+------------------------------+
|Description|Accounts  |Range                         |
+-----------+----------+------------------------------+
|Range 1    |[100, 105]|[100, 101, 102, 103, 104, 105]|
|Range 2    |[200, 203]|[200, 201, 202, 203]          |
+-----------+----------+------------------------------+

Answer 2

在Spark 2.4中，您可以使用sequence函数如果您有此数据框：

df.show()
+--------+
|column_1|
+--------+
|       1|
|       2|
|       3|
|       0|
+--------+

如果您使用从0到column_1的序列函数，则会得到以下信息：

df.withColumn("range", sequence(lit(0), col("column_1"))).show()
+--------+------------+
|column_1|       range|
+--------+------------+
|       1|      [0, 1]|
|       2|   [0, 1, 2]|
|       3|[0, 1, 2, 3]|
|       0|         [0]|
+--------+------------+

根据您的情况，将“帐户”列的索引用作参数

df.withColumn("Range", sequence(col("Accounts")(0), col("Accounts")(1))).show()
 +--------------+-------------+-----------------------------+
 |   Description|     Accounts|                        Range|
 +--------------+-------------+-----------------------------+
 |       Range 1|   [101, 105]|    [101, 102, 103, 104, 105]|
 |       Range 2|   [200, 203]|         [200, 201, 202, 203]|
 +--------------+-------------+-----------------------------+

如何使用另一列指定的范围内的所有值创建列

2 个答案: