从Pig中的一条线中提取

时间:2014-08-28 20:50:31

标签: hadoop apache-pig

我正在尝试按网址对数据进行分组。我的数据目前存储在一个长行中。例如。: {“mobile”,“country:US”,“url:1234.com”,“newuser:y”}等。

这是我到目前为止所做的:

RAW = LOAD '/data/events/raw/2014-08-21/' as (line:chararray);
A = FILTER RAW BY (INDEXOF(line,'mobile') != -1)
B = LIMIT A 800;
URL = GROUP B BY (INDEXOF(line, 'url'));
STORE URL INTO '/user/hadoopuser/RS_traffic.txt';

如何从字符串中提取URL以便按其分组?我可以使用正则表达式吗?

1 个答案:

答案 0 :(得分:0)

您可以使用REGEX_EXTRACT()功能:

REGEX_EXTRACT Javadoc

RAW = LOAD '/data/events/*' AS (line:chararray);
C = FOREACH RAW GENERATE REGEX_EXTRACT(value, '<your_pattern>', 1) AS url:chararray;
A = FILTER RAW BY (INDEXOF(line,'mobile') != -1)
URL = GROUP C BY url;
....
STORE URL INTO '/user/hadoopuser/RS_traffic.txt';