FILE COntent(test.txt):
Some specific column value: x192.168.1.2 blah blah
Some specific row value: y192.168.1.3 blah blah
Some specific field value: z192.168.1.4 blah blah
猪查询:
A = LOAD 'test.txt' USING PigStorage('\t') AS (data1: chararray , data2: chararray , data3: chararray, data4: chararray , data5: chararray , data6: chararray);
B = foreach A generate data3, data4;
C = filter B by data3 matches 'row';
D = foreach C generate data4;
E = foreach D generate TOKENIZE(data4);
输出:
((value:), (y192.168.1.3))
现在我想在这个输出包中提取特定的元组,比如第二元组(y192.168.1.3)。 在此之后我想提取IP地址。我正在尝试使用UDF,但卡住了。
答案 0 :(得分:3)
您可以使用Flatten Operator展平行李,然后使用过滤器提取IP地址。
E = foreach C generate flatten(TOKENIZE(data4));
F = filter E by $0 matches '.\\d+\\.\\d+\\.\\d+\\.\\d+'
希望这有帮助
答案 1 :(得分:3)
这就是我要做的事。
PIG脚本
A = LOAD 'test.txt' USING PigStorage('\t') AS (data1: chararray , data2: chararray , data3: chararray, data4: chararray , data5: chararray , data6: chararray);
B = foreach A generate data3, data4;
C = filter B by data3 matches 'row';
D = foreach C generate data4;
E = foreach D generate REGEX_EXTRACT($0,'value: .([0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+).*', 1);
输出
(192.168.1.3)
如果需要,您可以使用更疯狂的正则表达式来提取IP地址:Extract ip addresses from Strings using regex
答案 2 :(得分:1)
public class someClass extends EvalFunc<String>
{
public String exec(Tuple input) throws IOException {
DataBag bag = (DataBag)input.get(0);
Iterator<Tuple> it = bag.iterator();
Tuple tup;
for(int i = 0; i < 2; i++)
{
tup = it.next();
}
String ipString = tup.get(0);
String ip = //get ip from string with a regex
return ip;
}
}
当然你应该添加一些输入检查(空输入,包大小为1等)并保护代码。