PIG REGEX_EXTRACT_ALL无效

时间:2015-10-27 10:51:36

标签: regex apache-pig

我有以下数据。

•   PRT_Edit & Set Shopping Cart in Retail

•   PRT_Confirm Shopping Cart for Goods

o   PRT-Ret_Process Supplier Invoice

o   PRT-Web_Overview of Orders

o   PRT_Update Outfirst Agreement

PRT_Axn_-Purchase and Requisition

数据包含特殊符号,制表符空格和空格。我想只从这些数据中提取文本部分:

PRT_Edit & Set Shopping Cart in Retail

PRT_Confirm Shopping Cart for Goods

PRT-Ret_Process Supplier Invoice

PRT-Web_Overview of Orders

PRT_Update Outfirst Agreement

我尝试在Pig Script中使用REGEX_EXTRACT_ALL如下所示,但它不起作用。

PRT = LOAD '/DATA' USING TEXTLOADER() AS (LINE:CHARARRAY);

Cleansed = FOREACH PRT GENERATE REGEX_EXTRACT_ALL(LINE,'[A-Z]*') AS DATA;

当我尝试转储 已清理 时,它不会显示任何数据。请任何人帮忙。

1 个答案:

答案 0 :(得分:1)

您可以使用

Cleansed = FOREACH PRT GENERATE FLATTEN(
      REGEX_EXTRACT_ALL(LINE, '^[^a-zA-Z]*([a-zA-Z].*[a-zA-Z])[^a-zA-Z]*$'))
       AS (FIELD1:chararray), LINE;

正则表达式匹配以下内容:

  • ^ - 字符串开头
  • [^a-zA-Z]* - 字符类
  • 中拉丁字母以外的0个或多个字符
  • ([a-zA-Z].*[a-zA-Z]) - 我们之后将FIELD1引用的捕获组,匹配:
    • [a-zA-Z].*[a-zA-Z] - 拉丁字母,然后是任意字符,尽可能多(使用贪婪的*,而不是*?懒惰的字母)
  • [^a-zA-Z]* - 拉丁字母以外的0个或更多字符
  • $ - 字符串结尾