使用Spark格式化文本文件

时间:2017-08-22 00:34:48

标签: scala apache-spark

我是Spark新手,使用Scala 2.10和Spark 1.6。 尝试格式化Input_file_001.txt,如下所示,

Input_file_001.txt:

class EmployeeService {

    def dataSource

    def printTable(){
        def sql = new Sql(dataSource)
        def tableMap = [:]
        int count=0
        sql.eachRow("SELECT * FROM employee") { row ->
            tableMap.'first_name' = row.first_name
            tableMap.'last_name' = row.last_name
            tableMap.'born' = row.born
            print "\nIteration No " + count
            count++
        }
        sql.close()
        for ( e in tableMap ) {print "key = ${e.key}, value = ${e.value}"}
    }

输出文件为

Input_file_001.txt

Dept 0100 Batch Load Errors for 8/16/2016 4:45:56 AM 

Case 1111111111
Rectype: ABCD 
Key:UMUM_REF_ID=A12345678,UMSV_SEQ_NO=1
UMSV ERROR  :UNITS_ALLOW must be > or = UNITS_PAID 

Case 2222222222
Rectype: ABCD 
Key:UMUM_REF_ID=B87654321,UMSV_SEQ_NO=2
UMSV ERROR  :UNITS_ALLOW must be > or = UNITS_PAID 
NTNB ERROR  :Invalid Value                          NTNB_MCTR_SUBJ=AMOD

Case 3333333333
Rectype: WXYZ 
Key:UMUM_REF_ID=U19817250,UMSV_SEQ_NO=2
UMSV ERROR  :UNITS_ALLOW must be > or = UNITS_PAID 

我试图像下面那样实现它,

case~Rectype~key,Error
1111111111~ABCD~UMUM_REF_ID=A12345678,UMSV_SEQ_NO=1~UMSV ERROR  :UNITS_ALLOW must be > or = UNITS_PAID
2222222222~ABCD~UMUM_REF_ID=B87654321,UMSV_SEQ_NO=2~UMSV ERROR  :UNITS_ALLOW must be > or = UNITS_PAID,NTNB ERROR  :Invalid Value                          NTNB_MCTR_SUBJ=AMOD
3333333333~WXYZ~UMUM_REF_ID=U19817250,UMSV_SEQ_NO=2~UMSV ERROR  :UNITS_ALLOW must be > or = UNITS_PAID

上面的代码给了我Array [String],无法将其转发。 任何帮助表示赞赏。

1 个答案:

答案 0 :(得分:0)

您可以使用wholeTextFiles api来读取输入文件,该文件将读取输入文件为行==> (filename, whole text as one line)。然后,您可以操纵整个文本行并将其转换为所需的输出。最后,您可以添加header并将其保存到文件

val rdd = sc.wholeTextFiles("path to Input_file_001.txt")
val finalRdd = rdd.flatMap(tuple => tuple._2.split("\nCase ")
  .map(record => record.replace("\nRectype: ", "~").trim
    .replace("\nKey:", "~").trim
    .replace("\nUMSV ERROR  :", "~UMSV ERROR  :").trim
    .replace("\nNTNB ERROR  :", ",NTNB ERROR  :").trim)
).filter(record => !record.startsWith("Dept"))
val header: RDD[String] = sc.parallelize(Array("case~Rectype~key,Error"))
header.union(finalRdd).saveAsTextFile("path to ouput file")

你应该有以下输出

case~Rectype~key,Error
1111111111~ABCD ~UMUM_REF_ID=A12345678,UMSV_SEQ_NO=1~UMSV ERROR  :UNITS_ALLOW must be > or = UNITS_PAID
2222222222~ABCD ~UMUM_REF_ID=B87654321,UMSV_SEQ_NO=2~UMSV ERROR  :UNITS_ALLOW must be > or = UNITS_PAID ,NTNB ERROR  :Invalid Value                          NTNB_MCTR_SUBJ=AMOD
3333333333~WXYZ ~UMUM_REF_ID=U19817250,UMSV_SEQ_NO=2~UMSV ERROR  :UNITS_ALLOW must be > or = UNITS_PAID

我希望答案很有帮助