如何读取分成多行的记录以及如何在输入拆分期间处理损坏的记录

时间:2013-07-18 02:23:03

标签: hadoop mapreduce input-split

我有一个日志文件,如下所示

Begin ... 12-07-2008 02:00:05         ----> record1
incidentID: inc001
description: blah blah blah 
owner: abc 
status: resolved 
end .... 13-07-2008 02:00:05 
Begin ... 12-07-2008 03:00:05         ----> record2 
incidentID: inc002 
description: blah blah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah
owner: abc 
status: resolved 
end .... 13-07-2008 03:00:05

我想使用mapreduce来处理它。我想提取事件ID,状态以及事件所需的时间

如何处理两个记录,因为它们具有可变记录长度以及如果输入拆分在记录结束之前发生了什么。

2 个答案:

答案 0 :(得分:5)

您需要编写自己的输入格式和记录阅读器,以确保在记录分隔符周围正确分割文件。

基本上你的记录阅读器需要寻找它的分割字节偏移量,向前扫描(读取行)直到它找到:

  • Begin ...
    • 读取下一行end ...行的行,并在开头和结尾之间提供这些行作为下一条记录的输入
  • 扫描分割结束时的过去或查找EOF

这在算法上类似于Mahout的XMLInputFormat如何处理多行XML作为输入 - 事实上,您可以直接修改此源代码以处理您的情况。

如@ irW的答案所述,NLineInputFormat是另一种选择,如果你的记录每条记录有固定数量的行,但对于较大的文件来说效率非常低,因为它必须打开并读取整个文件才能发现输入格式的getSplits()方法中的行偏移。

答案 1 :(得分:1)

在您的示例中,每条记录的行数相同。如果是这种情况你可以使用NLinesInputFormat,如果不可能知道行数可能会更困难。 (有关NlinesInputFormat的更多信息:http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html