我正在进行从weblogic登录到csv格式的数据转换,以便让Hive进一步执行作业 但是,我在下面的日志中遇到了一些问题:
####<Mar 16, 2015 12:27:27 AM HKT> <Info> <WebLogicServer> <hklp141p.xxxx.com> <> <[ACTIVE] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <> <> <1426436849796> <BEA-000000> <Initializing self-tuning thread pool>
####<Mar 16, 2015 12:27:28 AM HKT> <Info> <Management> <hklp141p.xxxx.com> <> <[ACTIVE] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <> <> <1426436850227> <BEA-000000> <WebLogic Server "WLS_DOM_CMN1" version:
WebLogic Server 10.3.6.0.8 PSU Patch for BUG18040640 THU MARCH 27 15:54:42 IST 2014
WebLogic Server 10.3.6.0 Tue Nov 15 08:52:36 PST 2011 1441050 Copyright (c) 1995, 2011, Oracle and/or its affiliates. All rights reserved.>
我可以使用下面的PIG脚本来提取前两行:
A = LOAD '/user/hdfs/csv/log/flume/*';
B = FOREACH A GENERATE REPLACE($0,',','');
C = FOREACH B GENERATE FLATTEN(REGEX_EXTRACT_ALL($0, '####<([^<>]+)> <([^<>]+)> <([^<>]+)> <([^<>]+)> <([^<>]*?)> <([^<>]+)> <<?([^<>]*?)>?> <([^<>]*?)> <([^<>]*?)> <([^<>]*?)> <([^<>]+)> <([^<>]+)>? ?'));
dump C;
结果如下:
(Mar 16 2015 12:27:27 AM HKT,Info,WebLogicServer,hklp141p.xxxx.com,,[ACTIVE] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)',,,,1426436847404,BEA-000000,Starting WebLogic Server with Oracle JRockit(R) Version R28.3.2-14-160877-1.6.0_75-20140321-2359-linux-x86_64 from Oracle Corporation)
(Mar 16 2015 12:27:28 AM HKT,Info,Management,hklp141p.xxxx.com,,[ACTIVE] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)',,,,1426436848329,BEA-000000,Version: WebLogic Server 10.3.6.0.8 PSU Patch for BUG18040640 THU MARCH 27 15:54:42 IST 2014)
()
()
但是,最后两行应该在与第二行相同的消息内,预期结果应该是这样的:
(Mar 16 2015 12:27:27 AM HKT,Info,WebLogicServer,hklp141p.xxxx.com,,[ACTIVE] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)',,,,1426436847404,BEA-000000,Starting WebLogic Server with Oracle JRockit(R) Version R28.3.2-14-160877-1.6.0_75-20140321-2359-linux-x86_64 from Oracle Corporation)
(Mar 16 2015 12:27:28 AM HKT,Info,Management,hklp141p.xxxx.com,,[ACTIVE] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)',,,,1426436848329,BEA-000000,Version: WebLogic Server 10.3.6.0.8 PSU Patch for BUG18040640 THU MARCH 27 15:54:42 IST 2014 WebLogic Server 10.3.6.0.8 PSU Patch for BUG18040640 THU MARCH 27 15:54:42 IST 2014 WebLogic Server 10.3.6.0 Tue Nov 15 08:52:36 PST 2011 1441050 Copyright (c) 1995, 2011, Oracle and/or its affiliates. All rights reserved.)
我可以知道如何从PIG脚本中获取结果集?
更新:
我正在尝试为LOAD函数编写一个UDF,我发现返回的行依赖于这个函数: 文本值=(文本)recordReader.getCurrentValue();
但是,我仍然无法在代码中自定义读取行方法,我不确定是否应该修改代码的哪一部分,是否应该在prepareToRead函数内?
以下是示例代码:
package com.weblogic.pig;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.pig.*;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.*;
import org.apache.pig.data.*;
import java.io.IOException;
import java.util.*;
public class MyLoader extends LoadFunc {
protected RecordReader recordReader = null;
@Override
public void setLocation(String s, Job job) throws IOException {
FileInputFormat.setInputPaths(job, s);
}
@Override
public InputFormat getInputFormat() throws IOException {
return new PigTextInputFormat();
}
@Override
public void prepareToRead(RecordReader recordReader, PigSplit pigSplit) throws IOException {
this.recordReader = recordReader;
}
@Override
public Tuple getNext() throws IOException {
try {
boolean flag = recordReader.nextKeyValue();
if (!flag) {
return null;
}
Text value = (Text) recordReader.getCurrentValue();
String[] strArray = value.toString().split(",");
List lst = new ArrayList<String>();
int i = 0;
for (String singleItem : strArray) {
lst.add(i++, singleItem);
}
return TupleFactory.getInstance().newTuple(lst);
} catch (InterruptedException e) {
throw new ExecException("Read data error", PigException.REMOTE_ENVIRONMENT, e);
}
}
}
非常感谢!!
最诚挚的问候, 约翰逊