我有最多20列的输入文件(CSV)。我必须根据列数过滤输入文件。如果一行包含20列,则认为该行是好数据,否则是坏数据。
输入文件:
123456,"ID_SYS",12,"Status_code","feedback","HIGH","D",""," ",""," ","","
",9999," ",2013-05-02,9999-12-31,"N",1,2
我正在读取RDD文件并基于进行拆分,并检查行是否包含20列
val rdd = SparkConfig.spark.sparkContext.textFile(CommonUtils.loadConf.getString("conf.inputFile"))
val splitRDD = rdd.map(line =>Row.fromSeq(line.split(",")))
val goodRDD = splitRDD.filter(arr => arr.size == 20)
我必须将goodRDD转换为Dataframe?Dataset才能应用一些转换 我尝试了以下代码
val rowRdd = splitRDD.map{
case Array(c1,c2,c3 .... c20) => Row(c1.toInt,c2....)
case _ => badCount++
}
val ds = SparkConfig.spark.sqlContext.createDataFrame(rowRdd
,inputFileSchema)
我有20列,我在模式匹配中要写下20列吗?我想知道礼仪解决方案的最佳方法