Spark文件清理和转换

时间:2017-06-13 07:46:33

标签: apache-spark

I have a file with 5 columns tab separated value (tsv). I need to do data scrubbing and transformation. 
case 1) Remove the special character (\001 and \x0D ) with "" 
case 2) Filter rows which has less then 5 columns count as Bad_Row RDD
case 3) Iterate Bad_Row RDD and check if last character of row is "\n" then remove last character "\n" and append next row until we get column count 5 
Sample File Formate
------------------------------
one two 12345   four    five
aaa ppp 12345   ttt 
bbb
ccc rrr 12355
yyy
ddd
eee iii 12845   rrr     two

Good_Rows RDD
-------------------------------
one two 12345   four    five
eee iii 12845   rrr     two

BAD_Row RDD
-------------------------------
aaa ppp 12345   ttt 
bbb
ccc rrr 12355
yyy
ddd

删除" \ n"从第二行开始,将第三行附加到第二行,如果列数为5,则重新计算列数,然后将其视为good_row并将其视为Good_Rows RDD。

示例代码段

def FilterData(rdd):
    row=rdd.split("\t")
    col_count=len(row)
        if col_count!=5 :
            return row


textFile1=sc.textFile("hdfs://localhost:9000/A/test.tsv")
Clean_RDD=textFile1.map(lambda x: x.replace("\\001|\\x0D",""))  # case 1
Badrow_RDD=Clean_RDD.map(FilterData) # case 2

请帮助实施案例3

Thanks 
Vishal

1 个答案:

答案 0 :(得分:0)

以下是Scala中第3个问题的代码。

val data=sc.textFile("file:/home/rieter/Test_Streaming/ab.txt").map(x=>x.split(" +").map(x=>x.trim)).filter(x=>(x.size<5)).flatMap(x=>x).collect
 val data1=data.take(data.length-(data.length%5))
 data1.splitAt(5)

输出

   scala> data1.splitAt(5)
   res59: (Array[String], Array[String]) = (Array(aaa, ppp, 12345, ttt, bbb),Array(ccc, rrr, 12355, yyy, ddd))