我是Spark新手,使用Scala 2.10和Spark 1.6。 尝试格式化Input_file_001.txt,如下所示,
Input_file_001.txt:
class EmployeeService {
def dataSource
def printTable(){
def sql = new Sql(dataSource)
def tableMap = [:]
int count=0
sql.eachRow("SELECT * FROM employee") { row ->
tableMap.'first_name' = row.first_name
tableMap.'last_name' = row.last_name
tableMap.'born' = row.born
print "\nIteration No " + count
count++
}
sql.close()
for ( e in tableMap ) {print "key = ${e.key}, value = ${e.value}"}
}
输出文件为
Input_file_001.txt
Dept 0100 Batch Load Errors for 8/16/2016 4:45:56 AM
Case 1111111111
Rectype: ABCD
Key:UMUM_REF_ID=A12345678,UMSV_SEQ_NO=1
UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
Case 2222222222
Rectype: ABCD
Key:UMUM_REF_ID=B87654321,UMSV_SEQ_NO=2
UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
NTNB ERROR :Invalid Value NTNB_MCTR_SUBJ=AMOD
Case 3333333333
Rectype: WXYZ
Key:UMUM_REF_ID=U19817250,UMSV_SEQ_NO=2
UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
我试图像下面那样实现它,
case~Rectype~key,Error
1111111111~ABCD~UMUM_REF_ID=A12345678,UMSV_SEQ_NO=1~UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
2222222222~ABCD~UMUM_REF_ID=B87654321,UMSV_SEQ_NO=2~UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID,NTNB ERROR :Invalid Value NTNB_MCTR_SUBJ=AMOD
3333333333~WXYZ~UMUM_REF_ID=U19817250,UMSV_SEQ_NO=2~UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
上面的代码给了我Array [String],无法将其转发。 任何帮助表示赞赏。
答案 0 :(得分:0)
您可以使用wholeTextFiles
api来读取输入文件,该文件将读取输入文件为行==> (filename, whole text as one line)
。然后,您可以操纵整个文本行并将其转换为所需的输出。最后,您可以添加header
并将其保存到文件
val rdd = sc.wholeTextFiles("path to Input_file_001.txt")
val finalRdd = rdd.flatMap(tuple => tuple._2.split("\nCase ")
.map(record => record.replace("\nRectype: ", "~").trim
.replace("\nKey:", "~").trim
.replace("\nUMSV ERROR :", "~UMSV ERROR :").trim
.replace("\nNTNB ERROR :", ",NTNB ERROR :").trim)
).filter(record => !record.startsWith("Dept"))
val header: RDD[String] = sc.parallelize(Array("case~Rectype~key,Error"))
header.union(finalRdd).saveAsTextFile("path to ouput file")
你应该有以下输出
case~Rectype~key,Error
1111111111~ABCD ~UMUM_REF_ID=A12345678,UMSV_SEQ_NO=1~UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
2222222222~ABCD ~UMUM_REF_ID=B87654321,UMSV_SEQ_NO=2~UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID ,NTNB ERROR :Invalid Value NTNB_MCTR_SUBJ=AMOD
3333333333~WXYZ ~UMUM_REF_ID=U19817250,UMSV_SEQ_NO=2~UMSV ERROR :UNITS_ALLOW must be > or = UNITS_PAID
我希望答案很有帮助