如何从RDD Spark中的字符串数组中检索特定值

时间:2018-11-04 17:01:39

标签: scala apache-spark apache-spark-sql rdd

我有数据:

{CurrentDate:05.24.2008,Employeeid:90786532432,Division:TX_VG}
{Division:NW_VG,CurrentDate:01.18.2006,Employeeid:907806532432} 

作为RDD中的字符串数组。如何从此字符串数组中仅检索EmployeeidDivision?我有两组琴弦。字符串中的数据将永远不会以相同的顺序

1 个答案:

答案 0 :(得分:0)

尝试一下:

val rdd = sc.parallelize(Seq("{CurrentDate:05.24.2008,Employeeid:90786532432,Division:TX_VG}", 
                             "{Division:NW_VG,CurrentDate:01.18.2006,Employeeid:907806532432}"))

val rdd2 = rdd.map(x => (x.slice  
                      (x.indexOfSlice("Division:")+9, 
                       (x.indexOfSlice("}", (x.indexOfSlice("Division:")+9))) min 
                       (    if   (  x.indexOfSlice(",", (x.indexOfSlice("Division:")+9)) == -1) {1000000} else {x.indexOfSlice(",", (x.indexOfSlice("Division:")+9)) } )
                      )
                        ,
                     x.slice
                       (x.indexOfSlice("Employeeid:")+11, 
                       (x.indexOfSlice("}", (x.indexOfSlice("Employeeid:")+11))) min 
                       (    if   (  x.indexOfSlice(",", (x.indexOfSlice("Employeeid:")+11)) == -1) {1000000} else {x.indexOfSlice(",", (x.indexOfSlice("Employeeid:")+11)) } )
                      )

                    )
              )
rdd2.collect

返回:

res52: Array[(String, String)] = Array((TX_VG,90786532432), (NW_VG,907806532432))