我有一个日志文件,其中包含我想通过Spark处理的一些信息。唯一的问题是整个文件本身没有正确格式化。 所以我试图整齐地格式化它,只抓取我需要的数据。
现在我已经注意到大多数有用的信息包含一个" INFO"标签。所以我决定使用以下方法过滤:
val testje = realdata.filter(line => line.contains("INFO"))
但现在我想将结果数据处理成SQLContext,以便可视化数据(在zeppelin中)但是;
这是一个(非常小的)现在数据的例子:
2016-03-08 14:55:29,637 INFO [ajp-nio-8009-exec-1] n.t.f.s.FloorService [FloorService.java:281] Snoozing. Wait 569 more milliseconds. Time passed : 4431
2016-03-08 14:55:29,964 INFO [ajp-nio-8009-exec-3] n.t.f.c.FloorUpdateController [FloorUpdateController.java:67] Floor test received update from tile: 1, data = [false, false, false, false, false, false, false, false]
2016-03-08 14:55:30,582 INFO [ajp-nio-8009-exec-2] n.t.f.c.FloorUpdateController [FloorUpdateController.java:67] Floor test received update from tile: 1, data = [false, false, false, false, false, false, true, false]
2016-03-08 14:55:30,592 INFO [ajp-nio-8009-exec-2] n.t.f.s.FloorService [FloorService.java:284] delta time : 5387
2016-03-08 14:55:30,595 INFO [ajp-nio-8009-exec-2] n.t.f.s.ActivityService [ActivityService.java:31] Activity added for floor with id: test
2016-03-08 14:55:30,854 INFO [ajp-nio-8009-exec-4] n.t.f.c.FloorUpdateController [FloorUpdateController.java:67] Floor test received update from tile: 1, data = [false, false, false, false, false, false, false, false]
我真正需要的是日期,时间,方块ID和布尔值。
有没有办法在不考虑所有垃圾数据的情况下正确格式化?
这就是我现在正在尝试的事情(免责声明,我在这方面相当新,而且我有点畏惧它^^'):
import org.apache.commons.io.IOUtils
import java.net.URL
import java.nio.charset.Charset
val realdata = sc.textFile("/media/application.txt")
case class testClass(date: String, time: String, level: String, unknown1: String, unknownConsumer: String, unknownConsumer2: String, vloer: String, tegel: String, msg: String, bool1: String, bool2: String, bool3: String, bool4: String, bool5: String, bool6: String, bool7: String, bool8: String, batchsize: String, troepje1: String, troepje2: String)
//val testje = realdata.filter(line => line.contains("INFO"))
val mapData = realdata.map(s => s.split(" ")).filter(line => line.contains("INFO")).map(
s => testClass(s(0),
s(1),
s(2),
s(3),
s(4),
s(5),
s(6),
s(7),
s(8),
s(9),
s(10),
s(11),
s(12),
s(13),
s(14),
s(15),
s(16),
s(17),
s(18),
s(19)
)
).toDF()
mapData.registerTempTable("test")
答案 0 :(得分:1)
我会尝试这样做:
val regex = """^(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}),(\d{0,3}) INFO .+Floor test received update from tile: (\d+), data = (\[((false|true)(, ){0,1})+\])$""".r
final case class LogLine(date: Instant, tileId: String, data: Seq[Boolean])
realdata.flatMap({
case regex(date, time, millis, tileId, data, _*) =>
val mapper = new ObjectMapper() with ScalaObjectMapper
mapper.registerModule(DefaultScalaModule)
Seq(LogLine(
Instant.parse(s"${date}T$time.${millis}Z"),
tileId,
mapper.readValue[Seq[Boolean]](data)
))
case _ => Nil
})
案例类将是多维的,但在这种情况下,这可能是您想要的。如果你确实需要,你可以随后将它弄平。
如果要提高性能,可以使用mapPartitions而不是flatMap,并重用ObjectMapper。
答案 1 :(得分:1)
我建议您使用data
而非INFO
进行过滤,因为要拆分并转换为数据框的行包含data
。
我已根据您的case class
修改了您的代码,您可以根据需要进行更多编辑
val mapData = realdata
.filter(line => line.contains("data"))
.map(s => s.split(" ").toList)
.map(
s => testClass(s(0),
s(1).split(",")(0),
s(1).split(",")(1),
s(3),
s(4),
s(5),
s(6),
s(7),
s(8),
s(15),
s(16),
s(17),
s(18),
s(19),
s(20),
s(21),
s(22),
"",
"",
""
)
)
.toDF()
mapData.show(false)
希望有所帮助