在Spark API中有多少种方法可以将新列添加到数据框RDD中?

时间:2016-05-17 06:00:08

标签: scala apache-spark spark-dataframe

我可以想到只使用withColumn():

val df = sc.dataFrame.withColumn('newcolname',{ lambda row: row + 1 } ) 

但我如何将其概括为文本数据?例如我的DataFrame有

strning values说“这是一个字符串的示例”,我想提取

在val arraystring中的第一个和最后一个字:Array [String] = Array(first,last)

2 个答案:

答案 0 :(得分:2)

这是你正在寻找的东西吗?

val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)

import sqlContext.implicits._

val extractFirstWord = udf((sentence: String) => sentence.split(" ").head)
val extractLastWord = udf((sentence: String) => sentence.split(" ").reverse.head)

val sentences = sc.parallelize(Seq("This is an example", "And this is another one", "One_word", "")).toDF("sentence")
val splits = sentences
             .withColumn("first_word", extractFirstWord(col("sentence")))
             .withColumn("last_word", extractLastWord(col("sentence")))

splits.show()

然后输出是:

+--------------------+----------+---------+
|            sentence|first_word|last_word|
+--------------------+----------+---------+
|  This is an example|      This|  example|
|And this is anoth...|       And|      one|
|            One_word|  One_word| One_word|
|                    |          |         |
+--------------------+----------+---------+

答案 1 :(得分:1)