这是数据框df
:
org.apache.spark.sql.DataFrame = [year_artist: string, count: bigint]
df.show()
返回:
+--------------------+-----+
| year_artist|count|
+--------------------+-----+
| 1945_Dick Haymes| 5|
|1949_Ivory Joe Hu...| 1|
| 1955_Tex Ritter| 1|
我需要将第一列拆分为两个独立的部分,年份和艺术家。 我在考虑这样的事情:Spark map dataframe using the dataframe's schema。
但是,以下内容在我的实现中不起作用:
df.rdd.map(row => (row(0).getAs[String].split("_")(0), row(0).getAs[String].split("_")(1)))
也许有一种方法可以在不转换为RDD的情况下实现它?
答案 0 :(得分:3)
例如,您可以使用regexp_extract
:
import org.apache.spark.sql.functions.regexp_extract
df.select(
regexp_extract($"year_artist", "^(\\d{4})_(.*)", 1).alias("year"),
regexp_extract($"year_artist", "^(\\d{4})_(.*)", 2).alias("artist")
)
或split
:
import org.apache.spark.sql.functions.split
df.select(
split($"year_artist", "_")(0).alias("year"),
split($"year_artist", "_")(1).alias("artist")
)
答案 1 :(得分:1)
你可以使用split
(看起来与其他答案非常相似)。
val solution = artists.
withColumn("nested", split($"year_artist", "_")).
select($"nested"(0) as "year", $"nested"(1) as "artist")
scala> solution.show
+----+---------------+
|year| artist|
+----+---------------+
|1945| Dick Haymes|
|1949|Ivory Joe Hu...|
|1955| Tex Ritter|
+----+---------------+
您可以使用map
运算符执行类似操作。
val solution = artists.
select("year_artist"). // assume you want only one column to work with
as[String]. // personally I don't like Rows so make them Strings
map { year_artist => year_artist.split("_") }. // do the hard work using Scala
map { case Array(year, artist) => (year, artist) }. // assume there are only two fields
toDF("year", "artist")
scala> solution.show
+----+---------------+
|year| artist|
+----+---------------+
|1945| Dick Haymes|
|1949|Ivory Joe Hu...|
|1955| Tex Ritter|
+----+---------------+