从RDD中提取字符串

时间:2016-10-05 16:31:31

标签: apache-spark

我有一个DataFrame,其中的行有两个双打。我想生成一个格式化的String,它将DataFrame表示为JSON列表。这是我的代码:

df.rdd.take(5)
val values = df.rdd.map{ case Row(x :Double ,y: Double) => (x,y,50) }
来自RDD的

take(5)如下所示:

Array[org.apache.spark.sql.Row] = Array([41.64068433800631,37.689287325884315], [37.01941012184662,30.390807326639077], [34.02364443854447,40.55991398223156], [41.52505975127479,42.02651332703204], [39.33233947587333,33.62091706778894]) 

我想要一个看起来像这样的字符串:

"[[41.64068433800631,37.689287325884315, 50], [37.01941012184662,30.390807326639077, 50], [34.02364443854447,40.55991398223156, 50], [41.52505975127479,42.02651332703204, 50], [39.33233947587333,33.62091706778894, 50]]

我尝试过生成字符串的顺序方法,但是我得到了一个奇怪的错误:

val valuesCol = values.collect()

var a = "["

for( a <- 1 to valuesCol.length){
    a = a + "[" + valuesCol(1)._1+ "," + valuesCol(1)._2 + "," + valuesCol(1)._3 + "]"
}
a =  a + "]"

println(a)

错误是:

error: reassignment to val

正如您所看到的,avar。我不明白是什么问题。任何修复此错误或任何其他方法的方法都会受到影响。

1 个答案:

答案 0 :(得分:1)

您可以使用此功能轻松完成。

val data = Array((1,2,1),(1,2,11),(23,8,1))
val rdd = sc.parallelize(data)
val res ="["+  rdd.map{ case(x,y,z) => "["+ x + "," + y + "," + z + "]" }.collect.mkString(",") + "]"

输出:

res: String = [[1,2,1],[1,2,11],[23,8,1]]