如何映射org.apache.spark.rdd.RDD [Array [String]]类型的RDD?

时间:2015-09-15 14:45:26

标签: scala mapreduce apache-spark

我是Spark和Scala的新手。 我的RDD类型为org.apache.spark.rdd.RDD[Array[String]]

以下是myRdd.take(3)的商家信息。

Array(Array(1, 2524474, CBSGPRS, 1, 2015-09-09 10:42:03, 0, 47880, 302001131103734, NAT, "", 502161081073570, "", BLANK, UNK, "", "", "", MV_PVC, BLANK, 1, "", 0, 475078439, 41131;0;0, "", 102651;0;0, 3|3), Array(2, 2524516, CBSGPRS, 1, 2015-09-09 23:42:14, 0, 1260, 302001131104272, NAT, "", 502161081074085, "", BLANK, UNK, "", "", "", MV_PVC, BLANK, 1, "", 0, 2044745984, 3652;0;0, "", 8636;0;0, 3|3), Array(3, 2524545, CBSGPRS, 1, 2015-09-09 14:56:55, 0, 32886, 302001131101629, NAT, "", 502161081071599, "", BLANK, UNK, "", "", "", MV_PVC, BLANK, 1, "", 0, 1956194307, 14164657;0;0, "", 18231194;0;0, 3|3))

我正在尝试将其映射如下..

var gprsMap = frows.collect().map{ tuple =>
// bind variables to the tuple
var (recKey, origRecKey, recTypeId, durSpanId, timestamp, prevConvDur, convDur,
    msisdn, callType, aPtyCellId, aPtyImsi, aPtyMsrn, bPtyNbr, bPtyNbrTypeId,
    bPtyCellId, bPtyImsi, bPtyMsrn, inTrgId, outTrgId, callStatusId, suppSvcId, provChgAmt,
    genFld1, genFld2, genFld3, genFld4, genFld5) = tuple

var dtm = timestamp.split(" ");
var idx = timestamp indexOf ' '
var dt = timestamp slice(0, idx)
var tm = timestamp slice(idx + 1, timestamp.length)

// return the results tuple
((dtm(0), msisdn, callType, recTypeId, provChgAmt), (convDur))
}

我一直收到错误:

  

错误:对象Tuple27不是包scala的成员。

我不确定错误是什么。有人可以帮忙吗?

1 个答案:

答案 0 :(得分:3)

问题是Scala只支持最多22个字段的元组。此外,您的frows: RDD[Array[String]]包含Array[String]元素。因此,map函数中的tuple变量也是Array[String]类型。因此,无法将变量tuple取消应用到元组中。

但你可以做的是直接通过索引访问数组的元素。

val recKey = tuple(0)
val timestamp = tuple(4)
...