Question

我有一个数据框：

+--------------------------------------+------------------------------------------------------------+
|item                                  |item_codes                                               |
+--------------------------------------+------------------------------------------------------------+
|loose fit long sleeve swim shirt women|["2237741011","1046622","1040660","7147440011","7141123011"]|
+--------------------------------------+------------------------------------------------------------+

模式如下=

root
 |-- item: string (nullable = true)
 |-- item_codes: string (nullable = true)

如何在Scala中将列item_codes字符串转换为Array [String]？

Answer 1

您可以使用regexp_replace删除引号/方括号，然后使用split删除ArrayType列：

val df = Seq(
  ("abc", "[\"2237741011\",\"1046622\",\"1040660\",\"7147440011\",\"7141123011\"]")
).toDF("item", "item_codes")

df.
  withColumn("item_codes", split(regexp_replace($"item_codes", """\[?\"\]?""", ""), "\\,")).
  show(false)
// +----+------------------------------------------------------+
// |item|item_codes                                            |
// +----+------------------------------------------------------+
// |abc |[2237741011, 1046622, 1040660, 7147440011, 7141123011]|
// +----+------------------------------------------------------+

Answer 2

您可以在执行一些“预处理”之后使用split方法

val col_names = Seq("item", "item_codes")

val data = Seq(("loose fit long sleeve swim shirt women", """["2237741011","1046622","1040660","7147440011","7141123011"]"""))

val df = spark.createDataFrame(data).toDF(col_names: _*)

// chop off first 2 and last 2 character and split at ","
df.withColumn("item_codes", split(expr("substring(item_codes, 3, length(item_codes)-4)"), """","""")).printSchema

如果可以更改格式，则使用正则表达式可能会更灵活，因为leo建议切掉所有不是数字或,的内容，然后在,处拆分

如何在Scala / Spark中使用字符串将列隐式转换为数组[String]？

2 个答案: