我正在尝试在Scala中重写SQL查询。
Message
出现在文件的第4列。msg
位于Message
的第3列,即CSV(MESSAGE >>>
)。示例文件数据:
[06-26 00:01:52,036] | Container : 5 | INFO | relation ID: 00002ZaaaaaaXdsZb:-1:55609051-1879-4be8-b1c9-1d2006b17135, Message: acadeontroller.java recordLogRequest - 50 (...) , MESSAGE >>> API - XX_XX_XX {CHECKSUM=9ABF5975467E394F54442FBD4F6473D3,MEMBER_TYPE=}
查询如下所示:
INSERT OVERWRITE TABLE staging.cleaned_data_7 SELECT * FROM staging.cleaned_data_6 WHERE msg NOT LIKE '%KEEP_ALIVE%' AND msg NOT LIKE '%XXX_CHANNEL_SERVICE%' AND msg NOT LIKE '%XXX Finished%' AND msg NOT LIKE '%API -%' ;我试过两种方法。第一种方法是使用
map
和filter
,这将无法提取与案例匹配的整个记录。我只能提取Message
字段。由于它是一个SELECT *
查询,我无法使用它。
val sample = sc.textFile("file:////home/user/sample.txt").map(x=>x.split('|')(3)).map(x=>x.split(',')(2))
val myFilter = sample.filter(x =>
!(x contains "KEEP_ALIVE") &&
!(x contains "XXX_CHANNEL_SERVICE") &&
!(x contains "XXX Finished") &&
!(x contains "API -") )
方法二:我正在使用partition
函数。但我面临一个错误。
val (valid,invalid) = readFile.partition{ line=>
val Message = line.split('|')(3).split(',')(2).toString
Message.filter(x =>
!(x contains "KEEP_ALIVE") &&
!(x contains "XXX_CHANNEL_SERVICE") &&
!(x contains "XXX Finished") &&
!(x contains "API -")
)
}
<console>:48: error: value contains is not a member of Char
答案 0 :(得分:2)
尝试执行拆分内部过滤器,如下所示:
val skippedMessages = List("KEEP_ALIVE", "XXX_CHANNEL_SERVICE", "XXX Finished", "API -")
val result = sample.filter { line =>
val message = line.split('|')(3).split(',')(2)
!skippedMessages.exists(message.contains)
}
答案 1 :(得分:1)
在此声明后:val message = line.split('|')(3).split(',')(2).toString
,变量message
为String
。
当您filter()
String
时,您正在提取单个Char
元素,并过滤哪些Char
要保留以及哪些要遗漏。
此外,partition()
方法需要Boolean
结果,filter()
无法提供。
试试这个,看看它是否让你更接近。
val (valid,invalid) = readFile.partition{ line=>
val message = line.split('|')(3).split(',')(2).toString
!(message contains "KEEP_ALIVE") &&
!(message contains "XXX_CHANNEL_SERVICE") &&
!(message contains "XXX Finished") &&
!(message contains "API -")
}