Pig Latin - 从chararray行中提取符合两种不同过滤条件的字段,并在一个包中分组

时间:2014-09-15 16:09:04

标签: apache-pig

我是Pig Latin的新手。 我想从日志文件中提取符合过滤条件的所有行(具有单词“line_token”),然后从这些匹配行中提取满足两个单独字段匹配条件的两个不同字段。由于线条结构不合理,我将它们作为char数组加载。 当我尝试运行以下代码时 - 我收到错误 “无效的资源架构:包架构必须将元组作为其字段” 我试图对元组执行显式转换,但这不起作用

input_lines = LOAD '/inputdir/' AS ( line:chararray);

filtered_lines = FILTER input_lines BY (line MATCHES  '.*line_token1.*' );

tokenized_lines = FOREACH filtered_lines GENERATE FLATTEN(TOKENIZE(line)) AS tok_line;

my_wordbag = FOREACH tokenized_lines {
     word1 = FILTER tok_line BY ( $0 MATCHES  '.*word_token1.*'  ) ;
     word2 = FILTER tok_line BY ( $0 MATCHES  '.*word_token1.*' ) ;
     GENERATE word1 , word2 as my_tuple ;
  -- I also tried --> GENERATE (word1 , word2) as my_tuple ;
    }

dump my_wordbag;

我想我采取了非常错误的做法。 请注意 - 我的日志结构不合理 - 所以我无法修复我加载的方式 对感兴趣的行进行加载和初始过滤(这很简单) - 我想我需要做一些不同的事情而不是标记化行并迭代遍历试图查找字段的字段。 或者也许我应该使用连接?

另外,如果我事先知道所有文本字段的行结构,那么将以不同的方式加载它(而不是像chararray)使它成为一个更容易的问题?

现在我做了一个妥协 - 我在我的原始行过滤器中添加了一个额外的过滤器子句,并且从行中选择了一个字段。当我回到它时,我将尝试使用连接并发布该代码... - 这是我的工作代码,它让我获得了有用的输出 - 但不是我想要的全部。

-- read input lines from poorly structured log
input_lines = LOAD '/log-in-dir-in-hdfs' AS ( line:chararray) ;

-- Filter for line filter criteria and date interested in passed as arg
filtered_lines = FILTER input_lines BY (
       ( line MATCHES  '.*line_filter1*' )
       AND ( line MATCHES '.*line_filter2.*' )
       AND ( line MATCHES '.*$forDate.*' )
       ) ;

-- Tokenize every line
tok_lines = FOREACH filtered_lines
        GENERATE TOKENIZE(line) AS tok_line;

-- Pick up specific field frm tokenized line based on column filter criteria
fnames =   FOREACH tok_lines  {
        fname = FILTER tok_line BY ( $0 MATCHES  '.*field_selection.*' ) ;
        GENERATE FLATTEN(fname) as nnfname;
        }
-- Count occurances of that field and store it with field name 
-- My original intent is to store another field name as well 
-- I will do that once I figure how to put both of them in a tuple 
flgroup    = FOREACH fnames
         GENERATE FLATTEN(TOKENIZE((chararray)$0)) as cfname;
grpfnames  = group flgroup by cfname;
readcounts = FOREACH grpfnames GENERATE COUNT(flgroup), group ;
STORE readcounts INTO '/out-dir-in-hdfs';

2 个答案:

答案 0 :(得分:2)

据我所知,在FLATTEN操作之后,每一行都有单行(tok_line),并且你想从每一行中提取2个单词。 REGEX_EXTRACT将帮助您实现这一目标。我不是REGEX专家,所以请留下将REGEX部分写给你。

data = FOREACH tokenized_lines 
          GENERATE 
              REGEX_EXTRACT(tok_line, <first word regex goes here>) as firstWord,
              REGEX_EXTRACT(tok_line, <second word regex goes here>) as secondWord;

我希望这会有所帮助。

答案 1 :(得分:0)

您必须引用别名,而不是列。

所以:

word1 = FILTER tokenized_lines BY ( $0 MATCHES  '.*word_token1.*'  ) ;

word1和word2也将是别名,而不是列。

你如何看待输出?