在映射文件

时间:2016-05-15 11:44:12

标签: regex replace apache-nifi

在我的特定情况下,我需要澄清NiFi中ReplaceTextWithMapping的用法。我的输入文件如下所示:

{"field1" : "A",
"field2" : "A",
"field3": "A"
}

映射文件看起来像这样:

 Header1;Header2;Header3
 A;some text;2

我的预期结果如下:

   {"field1" : "some text",
    "field2": "A",
    "field3": "A2"
    }

正则表达式集如下:

[A-Z0-9]+

它匹配映射文件中的字段键(我们期望大写字母或大写字母+数字),但后来我不确定你是如何决定你想要的值(从col 2或col3)分配输入值。此外,我的field2不应该更改,并且需要保留从输入值获得的相同值,不涉及映射。目前,我得到的是这样的事情:

  {"field1" : "some text A2",
    "field2": "some text A2",
    "field3": "some text A2"
    }

我想我的主要问题是:您是否可以使用来自映射文件的不同列的不同值在输入文件中映射相同的值?

谢谢

编辑:我正在使用ReplaceTextWithMapping,这是Apache NiFi中的开箱即用处理器(v.5.1)。在我的数据流中,我最终得到了一个Json文件,我需要在其中应用一些来自外部文件的映射,我想在内存中加载(而不是使用ExtractText进行解析)。

2 个答案:

答案 0 :(得分:0)

转发

看起来你正在使用JSON字符串,通过JSON解析引擎使用这样的字符串会更容易,因为JSON结构允许创建困难的边缘情况,这使得使用正则表达式进行解析变得困难。话虽如此,我相信你有理由,我不是正则表达警察。

描述

要进行这样的替换,将更容易捕获您将保留的子串以及要替换的子串。

(\{"[a-z0-9]+"\s*:\s*")([a-z0-9]+)("[,\r\n]+"[a-z0-9]+"\s*:\s*")([a-z0-9]+)("[,\r\n]+"[a-z0-9]+"\s*:\s*")([a-z0-9]+)("[,\r\n]+\})

替换为:$1SomeText$3$4$5A2$7

Regular expression visualization

注意:我建议在此表达式中使用以下标志:Case Insensitive,Dot匹配包括换行符在内的所有字符。

Exmaples

Live Deno

此示例显示正则表达式如何与源文本匹配: https://regex101.com/r/vM1qE2/1

来源文字

{"field1" : "A",
"field2" : "A",
"field3": "A"
}

替换后

{"field1" : "SomeText",
"field2" : "A",
"field3": "A2"
}

解释

NODE                     EXPLANATION
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    \{                       '{'
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    [a-z0-9]+                any character of: 'a' to 'z', '0' to '9'
                             (1 or more times (matching the most
                             amount possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    :                        ':'
----------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    [a-z0-9]+                any character of: 'a' to 'z', '0' to '9'
                             (1 or more times (matching the most
                             amount possible))
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    [,\r\n]+                 any character of: ',', '\r' (carriage
                             return), '\n' (newline) (1 or more times
                             (matching the most amount possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    [a-z0-9]+                any character of: 'a' to 'z', '0' to '9'
                             (1 or more times (matching the most
                             amount possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    :                        ':'
----------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  (                        group and capture to \4:
----------------------------------------------------------------------
    [a-z0-9]+                any character of: 'a' to 'z', '0' to '9'
                             (1 or more times (matching the most
                             amount possible))
----------------------------------------------------------------------
  )                        end of \4
----------------------------------------------------------------------
  (                        group and capture to \5:
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    [,\r\n]+                 any character of: ',', '\r' (carriage
                             return), '\n' (newline) (1 or more times
                             (matching the most amount possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    [a-z0-9]+                any character of: 'a' to 'z', '0' to '9'
                             (1 or more times (matching the most
                             amount possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    :                        ':'
----------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
  )                        end of \5
----------------------------------------------------------------------
  (                        group and capture to \6:
----------------------------------------------------------------------
    [a-z0-9]+                any character of: 'a' to 'z', '0' to '9'
                             (1 or more times (matching the most
                             amount possible))
----------------------------------------------------------------------
  )                        end of \6
----------------------------------------------------------------------
  (                        group and capture to \7:
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    [,\r\n]+                 any character of: ',', '\r' (carriage
                             return), '\n' (newline) (1 or more times
                             (matching the most amount possible))
----------------------------------------------------------------------
    \}                       '}'
----------------------------------------------------------------------
  )                        end of \7

答案 1 :(得分:0)

所以我潜入ReplaceTextWithMapping试图让它来解决你的用例,但我认为它不足以做你想做的事情。目前它的设计几乎只是为了这个目的:匹配一个简单的正则表达式,将一组非空白字符映射到另一组字符(可以有空格和后引用)。

当您将用例视为纯文本时,它将根据另一个捕获组和映射文件的值更改一个捕获组的值。从JSON的角度来看,您的用例要简单得多,您希望根据键和映射文件更改键/值对的值。旁注,如果你不需要映射文件,我相信有一个新的JSON到JSON处理器将在0.7.0 [1]中出现。

至于寻找解决方案,两种查看问题的方式都是有效的。 ReplaceTextWithMapping当然可以使用扩展功能来允许高级用例,但可能会使它太复杂(尽管由于它的功能范围不明确,现在可能会更加混乱)。一个新的处理器,沿着" ReplaceJsonWithMapping"当然也可以加入,但需要明确界定它的范围和目的。

另外,对于更直接的解决方案,始终可以选择使用ExecuteScript处理器。这里[2]是博客的链接(由ExecuteScript的创建者编写),其中概述了如何编写基本的JSON-to-JSON处理器。需要添加更多逻辑才能读取映射文件。

[1] https://issues.apache.org/jira/browse/NIFI-361 [2] http://funnifi.blogspot.com/2016/02/executescript-json-to-json-conversion.html