Question

如何将已作为字符串读取的列转换为数组列？即从模式下面转换

scala> test.printSchema
root
 |-- a: long (nullable = true)
 |-- b: string (nullable = true)

+---+---+
|  a|  b|
+---+---+
|  1|2,3|
+---+---+
|  2|4,5|
+---+---+

致：

scala> test1.printSchema
root
 |-- a: long (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: long (containsNull = true)

+---+-----+
|  a|  b  |
+---+-----+
|  1|[2,3]|
+---+-----+
|  2|[4,5]|
+---+-----+

如果可能，请分享scala和python实现。在相关的说明中，如何在从文件本身读取时处理它？我有大约450列的数据，其中很少我想用这种格式指定。目前我在pyspark阅读如下：

df = spark.read.format('com.databricks.spark.csv').options(
    header='true', inferschema='true', delimiter='|').load(input_file)

感谢。

Answer 1

有各种方法，

最好的方法是使用split函数并转换为array<long>

data.withColumn("b", split(data("b"), ",").cast("array<long>"))

您还可以创建简单的udf来转换值

val tolong = udf((value : String) => value.split(",").map(_.toLong))

data.withColumn("newB", tolong(data("b"))).show

希望这有帮助！

Answer 2

使用UDF会为您提供确切的必需架构。像这样：

val toArray = udf((b: String) => b.split(",").map(_.toLong))

val test1 = test.withColumn("b", toArray(col("b")))

它会为您提供如下架构：

scala> test1.printSchema
root
 |-- a: long (nullable = true)
 |-- b: array (nullable = true)
 |    |-- element: long (containsNull = true)

+---+-----+
|  a|  b  |
+---+-----+
|  1|[2,3]|
+---+-----+
|  2|[4,5]|
+---+-----+

就文件读取本身应用模式而言，我认为这是一项艰巨的任务。因此，现在您可以在创建DataFrameReader的{{1}}之后应用转换。

我希望这有帮助！

Answer 3

在python（pyspark）中它将是：

from pyspark.sql.types import *
from pyspark.sql.functions import col, split
test = test.withColumn(
        "b",
        split(col("b"), ",\s*").cast("array<int>").alias("ev")
 )

Spark：将字符串列转换为数组

3 个答案: