Pig Latin:在字符数组中过滤数字<5和> = 5(文本和数字)

时间:2019-03-28 12:49:21

标签: apache-pig

我该如何过滤或分组少于5年的那些和超过5年的那些。我对Pig Latin非常陌生。 ID,例如BUS2003应该保持不变。

输入数据

ID,Experience
BUS2003,More than 17 years teaching experience
BUS1303,2 years teaching experience
BUS4543,13 plus years of teaching experience; 4 plus years of corporate experience
BUS2103,4 year + 6 years in business
BUS2913,8 yrs teaching experience

我知道如何将数据加载到PigStorage或CSVloader中,但是由于单词和数字在一起,我很难解决“体验”。

所需结果:

**Less than five years**
BUS1303,2 years teaching experience
BUS2103,4 year + 6 years in business

**Equal or greater than five years**
BUS2003,More than 17 years teaching experience
BUS4543,13 plus years of teaching experience; 4 plus years of corporate experience
BUS2913,8 yrs teaching experience

先谢谢了。

1 个答案:

答案 0 :(得分:1)

您必须提取数字然后拆分。这应该可以为您提供所需的信息

A = LOAD 'input.txt' USING PigStorage(',') AS (a1:chararray,a2:chararray);
B = FOREACH A GENERATE a1,a2,REGEX_EXTRACT(a2,'(\\d*)',1) as exp:int;
C = SPLIT B INTO C1 IF B.exp < 5, C2 IF B.exp >= 5;
DUMP C1;
DUMP C2;