PIG在日志中合并两行

时间:2015-06-09 02:00:49

标签: regex merge hive weblogic apache-pig

我正在进行从weblogic登录到csv格式的数据转换,以便让Hive进一步执行作业 但是,我在下面的日志中遇到了一些问题:

####<Mar 16, 2015 12:27:27 AM HKT> <Info> <WebLogicServer> <hklp141p.xxxx.com> <> <[ACTIVE] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <> <> <1426436849796> <BEA-000000> <Initializing self-tuning thread pool> 
####<Mar 16, 2015 12:27:28 AM HKT> <Info> <Management> <hklp141p.xxxx.com> <> <[ACTIVE] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <> <> <1426436850227> <BEA-000000> <WebLogic Server "WLS_DOM_CMN1" version:
WebLogic Server 10.3.6.0.8 PSU Patch for BUG18040640 THU MARCH 27 15:54:42 IST 2014
WebLogic Server 10.3.6.0  Tue Nov 15 08:52:36 PST 2011 1441050  Copyright (c) 1995, 2011, Oracle and/or its affiliates. All rights reserved.>

我可以使用下面的PIG脚本来提取前两行:

A = LOAD '/user/hdfs/csv/log/flume/*';
B = FOREACH A GENERATE REPLACE($0,',','');
C = FOREACH B GENERATE FLATTEN(REGEX_EXTRACT_ALL($0, '####<([^<>]+)> <([^<>]+)> <([^<>]+)> <([^<>]+)> <([^<>]*?)> <([^<>]+)> <<?([^<>]*?)>?> <([^<>]*?)> <([^<>]*?)> <([^<>]*?)> <([^<>]+)> <([^<>]+)>? ?'));
dump C;

结果如下:

(Mar 16 2015 12:27:27 AM HKT,Info,WebLogicServer,hklp141p.xxxx.com,,[ACTIVE] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)',,,,1426436847404,BEA-000000,Starting WebLogic Server with Oracle JRockit(R) Version R28.3.2-14-160877-1.6.0_75-20140321-2359-linux-x86_64 from Oracle Corporation)
(Mar 16 2015 12:27:28 AM HKT,Info,Management,hklp141p.xxxx.com,,[ACTIVE] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)',,,,1426436848329,BEA-000000,Version: WebLogic Server 10.3.6.0.8 PSU Patch for BUG18040640 THU MARCH 27 15:54:42 IST 2014)
()
()

但是,最后两行应该在与第二行相同的消息内,预期结果应该是这样的:

(Mar 16 2015 12:27:27 AM HKT,Info,WebLogicServer,hklp141p.xxxx.com,,[ACTIVE] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)',,,,1426436847404,BEA-000000,Starting WebLogic Server with Oracle JRockit(R) Version R28.3.2-14-160877-1.6.0_75-20140321-2359-linux-x86_64 from Oracle Corporation)
(Mar 16 2015 12:27:28 AM HKT,Info,Management,hklp141p.xxxx.com,,[ACTIVE] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)',,,,1426436848329,BEA-000000,Version: WebLogic Server 10.3.6.0.8 PSU Patch for BUG18040640 THU MARCH 27 15:54:42 IST 2014 WebLogic Server 10.3.6.0.8 PSU Patch for BUG18040640 THU MARCH 27 15:54:42 IST 2014 WebLogic Server 10.3.6.0  Tue Nov 15 08:52:36 PST 2011 1441050  Copyright (c) 1995, 2011, Oracle and/or its affiliates. All rights reserved.)

我可以知道如何从PIG脚本中获取结果集?

更新:

我正在尝试为LOAD函数编写一个UDF,我发现返回的行依赖于这个函数: 文本值=(文本)recordReader.getCurrentValue();

但是,我仍然无法在代码中自定义读取行方法,我不确定是否应该修改代码的哪一部分,是否应该在prepareToRead函数内?

以下是示例代码:

package com.weblogic.pig;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.pig.*;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.*;
import org.apache.pig.data.*;

import java.io.IOException;
import java.util.*;


public class MyLoader extends LoadFunc {
  protected RecordReader recordReader = null;

  @Override
  public void setLocation(String s, Job job) throws IOException {
    FileInputFormat.setInputPaths(job, s);
  }

  @Override
  public InputFormat getInputFormat() throws IOException {
    return new PigTextInputFormat();
  }

  @Override
  public void prepareToRead(RecordReader recordReader, PigSplit pigSplit) throws IOException {
    this.recordReader = recordReader;
  }

  @Override
  public Tuple getNext() throws IOException {
    try {
      boolean flag = recordReader.nextKeyValue();
      if (!flag) {
        return null;
      }
      Text value = (Text) recordReader.getCurrentValue();
      String[] strArray = value.toString().split(",");
      List lst = new ArrayList<String>();
      int i = 0;
      for (String singleItem : strArray) {
        lst.add(i++, singleItem);
      }
      return TupleFactory.getInstance().newTuple(lst);
    } catch (InterruptedException e) {
      throw new ExecException("Read data error", PigException.REMOTE_ENVIRONMENT, e);
    }
  }
}

非常感谢!!

最诚挚的问候, 约翰逊

0 个答案:

没有答案