我有一个日志文件,下面给出了近乎快照:
<Dec 12, 2013 2:46:24 AM CST> <Error> <java.rmi.RemoteException>
<Dec 13, 2013 2:46:24 AM CST> <Error> <Io exception>
<Dec 14, 2013 2:46:24 AM CST> <Error> <garbage data
garbage data
garbade data
Io exception
>
<jan 01, 2014 2:46:24 AM CST> <Error> <garbage data
garbage data java.rmi.RemoteException
>
我正在尝试在其上构建分析。
我想做什么:
我想得到每年的例外情况
for Example: from above sample data my output should be
java.rmi.RemoteException 2013 1
Io exception 2013 2
java.rmi.RemoteException 2014 1
我的问题是什么:
1.You see hadoop processes line by line of a text file, so it considers Io exception as
a part of line 6 whereas it should be a part of line 3 (that is continued till line 7).
2. I can't use N line input formatter because ther's no fixed pattern of lines.
模式是什么,我想要什么:
The only pattern I see is that a line starts with a "<" and ends with a ">". In the
above example line 3 doesn't end with ">" hence I want the compiler to consider all the
data in the same line until it fetches a ">".
我希望编译器看到的示例数据是:
<Dec 12, 2013 2:46:24 AM CST> <Error> <java.rmi.RemoteException>
<Dec 13, 2013 2:46:24 AM CST> <Error> <Io exception>
<Dec 14, 2013 2:46:24 AM CST> <Error> <garbage data garbage data garbade data Io exception>
<jan 01, 2014 2:46:24 AM CST> <Error> <garbage data garbage data java.rmi.RemoteException>
如果有人能分享一段代码或想法来克服这个问题,我将很高兴。
提前致谢:)
答案 0 :(得分:0)
您需要实现InputFormat&amp; RecordReader。你真正需要的是StreamInputFormat的改编。这出现在hadoop-streaming项目中。
对于我们的多行XML用法,我们使用hadoop-straeming从开始标记读取到我们定义的结束标记。您可以检查来源并根据您的要求进行调整。