我使用scala作为spark,我想更新RDD中的一个列值,我的数据格式是这样的:
[510116,8042,1,8298,20170907181326,1,3,lineno805]
[510116,8042,1,8152,20170907182101,1,3,lineno805]
[510116,8042,1,8154,20170907164311,1,3,lineno805]
[510116,8042,1,8069,20170907165031,1,3,lineno805]
[510116,8042,1,8061,20170907170254,1,3,lineno805]
[510116,8042,1,9906,20170907171417,1,3,lineno805]
[510116,8042,1,8295,20170907174734,1,3,lineno805]
我的scala代码是这样的:
val getSerialRdd: RDD[Row]=……
我想更新包含数据20170907181326
的列,我希望数据符合以下格式:
[510116,8042,1,8298,2017090718,1,3,lineno805]
[510116,8042,1,8152,2017090718,1,3,lineno805]
[510116,8042,1,8154,2017090716,1,3,lineno805]
[510116,8042,1,8069,2017090716,1,3,lineno805]
[510116,8042,1,8061,2017090717,1,3,lineno805]
[510116,8042,1,9906,2017090717,1,3,lineno805]
[510116,8042,1,8295,2017090717,1,3,lineno805]
并输出RDD类型,如RDD [Row]。
我怎么能这样做?
答案 0 :(得分:2)
您可以定义这样的update
方法来更新行中的字段:
import org.apache.spark.sql.Row
def update(r: Row): Row = {
val s = r.toSeq
Row.fromSeq((s.take(4) :+ s(4).asInstanceOf[String].take(10)) ++ s.drop(5))
}
rdd.map(update(_)).collect
//res13: Array[org.apache.spark.sql.Row] =
// Array([510116,8042,1,8298,2017090718,1,3,lineno805],
// [510116,8042,1,8152,2017090718,1,3,lineno805],
// [510116,8042,1,8154,2017090716,1,3,lineno805],
// [510116,8042,1,8069,2017090716,1,3,lineno805],
// [510116,8042,1,8061,2017090717,1,3,lineno805],
// [510116,8042,1,9906,2017090717,1,3,lineno805],
// [510116,8042,1,8295,2017090717,1,3,lineno805])
更简单的方法是使用DataFrame API和substring
函数:
1)从rdd:
创建数据框val df = spark.createDataFrame(rdd, rdd.take(1)(0).schema)
// df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 6 more fields]
2)使用substring
转换列:
df.withColumn("_c4", substring($"_c4", 0, 10)).show
+------+----+---+----+----------+---+---+---------+
| _c0| _c1|_c2| _c3| _c4|_c5|_c6| _c7|
+------+----+---+----+----------+---+---+---------+
|510116|8042| 1|8298|2017090718| 1| 3|lineno805|
|510116|8042| 1|8152|2017090718| 1| 3|lineno805|
|510116|8042| 1|8154|2017090716| 1| 3|lineno805|
|510116|8042| 1|8069|2017090716| 1| 3|lineno805|
|510116|8042| 1|8061|2017090717| 1| 3|lineno805|
|510116|8042| 1|9906|2017090717| 1| 3|lineno805|
|510116|8042| 1|8295|2017090717| 1| 3|lineno805|
+------+----+---+----+----------+---+---+---------+
3)将数据帧转换为rdd很容易:
val getSerialRdd = df.withColumn("_c4", substring($"_c4", 0, 10)).rdd
答案 1 :(得分:0)
在某些情况下,您可能希望使用架构更新行
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
def update(r: Row, i: Int, a: Any): Row = {
val s: Array[Any] = r
.toSeq
.toArray
.updated(i, a)
new GenericRowWithSchema(s, r.schema)
}
rdd.map(update(_)).show(false)