I have a file with 5 columns tab separated value (tsv). I need to do data scrubbing and transformation.
case 1) Remove the special character (\001 and \x0D ) with ""
case 2) Filter rows which has less then 5 columns count as Bad_Row RDD
case 3) Iterate Bad_Row RDD and check if last character of row is "\n" then remove last character "\n" and append next row until we get column count 5
Sample File Formate
------------------------------
one two 12345 four five
aaa ppp 12345 ttt
bbb
ccc rrr 12355
yyy
ddd
eee iii 12845 rrr two
Good_Rows RDD
-------------------------------
one two 12345 four five
eee iii 12845 rrr two
BAD_Row RDD
-------------------------------
aaa ppp 12345 ttt
bbb
ccc rrr 12355
yyy
ddd
删除" \ n"从第二行开始,将第三行附加到第二行,如果列数为5,则重新计算列数,然后将其视为good_row并将其视为Good_Rows RDD。
def FilterData(rdd):
row=rdd.split("\t")
col_count=len(row)
if col_count!=5 :
return row
textFile1=sc.textFile("hdfs://localhost:9000/A/test.tsv")
Clean_RDD=textFile1.map(lambda x: x.replace("\\001|\\x0D","")) # case 1
Badrow_RDD=Clean_RDD.map(FilterData) # case 2
请帮助实施案例3
Thanks
Vishal
答案 0 :(得分:0)
以下是Scala中第3个问题的代码。
val data=sc.textFile("file:/home/rieter/Test_Streaming/ab.txt").map(x=>x.split(" +").map(x=>x.trim)).filter(x=>(x.size<5)).flatMap(x=>x).collect
val data1=data.take(data.length-(data.length%5))
data1.splitAt(5)
输出
scala> data1.splitAt(5)
res59: (Array[String], Array[String]) = (Array(aaa, ppp, 12345, ttt, bbb),Array(ccc, rrr, 12355, yyy, ddd))