如何识别csv文件中的空字段?

时间:2017-06-30 06:40:57

标签: arrays scala csv apache-spark mapping

我正在使用Spark 2.1.1和Scala 2.11.8。

我必须从csv文件中读取数据,其中列的范围从最小值6到最大值8.我必须拆分9个条目,一旦拆分,列0到5将始终具有数据。但是,数据可以在第6列到第8列中存在或不存在。我使用以下方法将所需列分离并存储在RDD中:

val read_file = sc.textFile("Path to input file");

val uid = read_file.map(line => {var arr = line.split(","); (arr(2).split(":")(0),arr(3),arr(4).split(":")(0),arr(5).split(":")(0),arr(6).split(":")(0),arr(7).split(":")(0),arr(8).split(":")(0))})

现在,在获得的RDD“uid”中,将始终填充第0列到第3列,但是4到7可能有也可能没有数据。例如:我正在读取数据的csv文件,

2017-05-09 21:52:42 , 1494391962 , p69465323_serv80i:10:450 , 7 , fb_406423006398063:396560, guest_861067032060185_android:671051, fb_100000829486587:186589, fb_100007900293502:407374, fb_172395756592775:649795

2017-05-09 21:52:42 , 1494391962 , z67265107_serv77i:4:45 , 2:Re , fb_106996523208498:110066, fb_274049626104849:86632, fb_111857069377742:69348, fb_127277511127344:46246

2017-05-09 21:52:42 , 1494391962 , v73392772_serv33i:9:1400 , 1:4x , c2eb11fd-99dc-4dee-a75c-bc9bfd2e0ae4iphone:314129, fb_217409795286934:294262

可以看出,第一个条目填充了所有9列,第二个条目填充了8个,第3个条目只填充了6个列。

从获得的RDD中,我必须将列arr(1)(0)与列arr(3)(0)映射到arr(7)(0)。第1列的映射应仅使用填充列进行从3到7,3到7之间的空列不必与第1列映射。我试图使用for循环执行此操作:

执行语句val uid = read_file.map()之后我有了这个:

(String, String, String, String, String, String, String) = (" p69465323_serv80i"," 7 "," fb_406423006398063"," guest_861067032060185_android"," fb_100000829486587"," fb_100007900293502"," fb_172395756592775")

我做:

for (var x <= 5 to 7) { if var arr => (arr(x) != null) {
val pairedRdd = uid.map(x => ((x._1, x._3), (x._1, x._4), (x._1, x._5), (x._1, x._6), (x._1, x._7)) ) }

这将适用于给定数据示例中的第一个语句,但不适用于第二个和第三个。

逻辑错误,我承认,但这只是传达我想要做的事情的想法。

P.S:不允许使用Spark SQL。

1 个答案:

答案 0 :(得分:1)

您可以执行以下操作

val read_file = sc.textFile("Path to input file")
val uid = read_file.map(line => line.split(",")).map(array => array.map(arr => {
    if(arr.contains(":")) (array(2).split(":")(0), arr.split(":")(0))
    else (array(2).split(":")(0), arr)
}))

现在正在做

uid.map(array => array.drop(2)).map(array => array.toSeq)

会给你rdd

WrappedArray(( p69465323_serv80i, p69465323_serv80i), ( p69465323_serv80i, 7 ), ( p69465323_serv80i, fb_406423006398063), ( p69465323_serv80i, guest_861067032060185_android), ( p69465323_serv80i, fb_100000829486587), ( p69465323_serv80i, fb_100007900293502), ( p69465323_serv80i, fb_172395756592775))
WrappedArray(( z67265107_serv77i, z67265107_serv77i), ( z67265107_serv77i, 2), ( z67265107_serv77i, fb_106996523208498), ( z67265107_serv77i, fb_274049626104849), ( z67265107_serv77i, fb_111857069377742), ( z67265107_serv77i, fb_127277511127344))
WrappedArray(( v73392772_serv33i, v73392772_serv33i), ( v73392772_serv33i, 1), ( v73392772_serv33i, c2eb11fd-99dc-4dee-a75c-bc9bfd2e0ae4iphone), ( v73392772_serv33i, fb_217409795286934))

而不是

uid.map(array => array.drop(2)).flatMap(array => array)

会给你rdd

( p69465323_serv80i, p69465323_serv80i)
( p69465323_serv80i, 7 )
( p69465323_serv80i, fb_406423006398063)
( p69465323_serv80i, guest_861067032060185_android)
( p69465323_serv80i, fb_100000829486587)
( p69465323_serv80i, fb_100007900293502)
( p69465323_serv80i, fb_172395756592775)
( z67265107_serv77i, z67265107_serv77i)
( z67265107_serv77i, 2)
( z67265107_serv77i, fb_106996523208498)
( z67265107_serv77i, fb_274049626104849)
( z67265107_serv77i, fb_111857069377742)
( z67265107_serv77i, fb_127277511127344)
( v73392772_serv33i, v73392772_serv33i)
( v73392772_serv33i, 1)
( v73392772_serv33i, c2eb11fd-99dc-4dee-a75c-bc9bfd2e0ae4iphone)
( v73392772_serv33i, fb_217409795286934)

选择是你的