我有一个数据框:
+--------------------------------------+------------------------------------------------------------+
|item |item_codes |
+--------------------------------------+------------------------------------------------------------+
|loose fit long sleeve swim shirt women|["2237741011","1046622","1040660","7147440011","7141123011"]|
+--------------------------------------+------------------------------------------------------------+
模式如下=
root
|-- item: string (nullable = true)
|-- item_codes: string (nullable = true)
如何在Scala中将列item_codes字符串转换为Array [String]?
答案 0 :(得分:1)
您可以使用regexp_replace
删除引号/方括号,然后使用split
删除ArrayType
列:
val df = Seq(
("abc", "[\"2237741011\",\"1046622\",\"1040660\",\"7147440011\",\"7141123011\"]")
).toDF("item", "item_codes")
df.
withColumn("item_codes", split(regexp_replace($"item_codes", """\[?\"\]?""", ""), "\\,")).
show(false)
// +----+------------------------------------------------------+
// |item|item_codes |
// +----+------------------------------------------------------+
// |abc |[2237741011, 1046622, 1040660, 7147440011, 7141123011]|
// +----+------------------------------------------------------+
答案 1 :(得分:0)
您可以在执行一些“预处理”之后使用split方法
val col_names = Seq("item", "item_codes")
val data = Seq(("loose fit long sleeve swim shirt women", """["2237741011","1046622","1040660","7147440011","7141123011"]"""))
val df = spark.createDataFrame(data).toDF(col_names: _*)
// chop off first 2 and last 2 character and split at ","
df.withColumn("item_codes", split(expr("substring(item_codes, 3, length(item_codes)-4)"), """","""")).printSchema
如果可以更改格式,则使用正则表达式可能会更灵活,因为leo建议切掉所有不是数字或,
的内容,然后在,
处拆分