如何在Scala / Spark中使用字符串将列隐式转换为数组[String]?

时间:2019-10-30 17:22:19

标签: arrays scala dataframe apache-spark

我有一个数据框:

+--------------------------------------+------------------------------------------------------------+
|item                                  |item_codes                                               |
+--------------------------------------+------------------------------------------------------------+
|loose fit long sleeve swim shirt women|["2237741011","1046622","1040660","7147440011","7141123011"]|
+--------------------------------------+------------------------------------------------------------+

模式如下=

root
 |-- item: string (nullable = true)
 |-- item_codes: string (nullable = true)

如何在Scala中将列item_codes字符串转换为Array [String]?

2 个答案:

答案 0 :(得分:1)

您可以使用regexp_replace删除引号/方括号,然后使用split删除ArrayType列:

val df = Seq(
  ("abc", "[\"2237741011\",\"1046622\",\"1040660\",\"7147440011\",\"7141123011\"]")
).toDF("item", "item_codes")

df.
  withColumn("item_codes", split(regexp_replace($"item_codes", """\[?\"\]?""", ""), "\\,")).
  show(false)
// +----+------------------------------------------------------+
// |item|item_codes                                            |
// +----+------------------------------------------------------+
// |abc |[2237741011, 1046622, 1040660, 7147440011, 7141123011]|
// +----+------------------------------------------------------+

答案 1 :(得分:0)

您可以在执行一些“预处理”之后使用split方法

val col_names = Seq("item", "item_codes")

val data = Seq(("loose fit long sleeve swim shirt women", """["2237741011","1046622","1040660","7147440011","7141123011"]"""))

val df = spark.createDataFrame(data).toDF(col_names: _*)

// chop off first 2 and last 2 character and split at ","
df.withColumn("item_codes", split(expr("substring(item_codes, 3, length(item_codes)-4)"), """","""")).printSchema

如果可以更改格式,则使用正则表达式可能会更灵活,因为leo建议切掉所有不是数字或,的内容,然后在,处拆分