Question

我有一个文件（csv），当在spark数据框中读取该文件时，其打印模式具有以下值

-- list_values: string (nullable = true)

list_values列中的值类似于：

[[[167, 109, 80, ...]]]

是否可以将其转换为数组类型而不是字符串？

我尝试将其拆分，并使用在线提供的代码来解决类似问题：

df_1 = df.select('list_values', split(col("list_values"), ",\s*").alias("list_values"))

但是如果我运行上面的代码，得到的数组将跳过原始数组中的很多值，即

以上代码的输出为：

[, 109, 80, 69, 5...

与原始数组不同（即-缺少167）

[[[167, 109, 80, ...]]]

由于我是火花的新手，所以我对它的完成方法并不了解（对于python，我可以完成ast.literal_eval，但是spark没有为此做准备。

所以我将再次重复这个问题：

如何将存储为字符串的数组转换/广播到array，即

'[]' to [] conversion

Answer 1

假设您的DataFrame是以下内容：

df.show()
#+----+------------------+
#|col1|              col2|
#+----+------------------+
#|   a|[[[167, 109, 80]]]|
#+----+------------------+

df.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)

您可以使用pyspark.sql.functions.regexp_replace删除前和后方括号。完成后，您可以在split上", "生成字符串：

from pyspark.sql.functions import split, regexp_replace

df2 = df.withColumn(
    "col3",
    split(regexp_replace("col2", r"(^\[\[\[)|(\]\]\]$)", ""), ", ")
)
df2.show()

#+----+------------------+--------------+
#|col1|              col2|          col3|
#+----+------------------+--------------+
#|   a|[[[167, 109, 80]]]|[167, 109, 80]|
#+----+------------------+--------------+

df2.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)
# |-- col3: array (nullable = true)
# |    |-- element: string (containsNull = true)

如果您希望该列为整数数组，则可以使用cast：

from pyspark.sql.functions import col
df2 = df2.withColumn("col3", col("col3").cast("array<int>"))
df2.printSchema()
#root
# |-- col1: string (nullable = true)
# |-- col2: string (nullable = true)
# |-- col3: array (nullable = true)
# |    |-- element: integer (containsNull = true)

在pyspark数据帧中处理字符串到数组的转换

1 个答案: