Question

我正在尝试使用正则表达式拆分字符串。我需要在 nifi 中使用正则表达式将字符串拆分成组。任何人都可以帮助我如何使用正则表达式分割下面的字符串。

或者我们如何给出分隔字符串的特定出现次数的分隔符。例如，在下面的字符串中，如何在第3次出现空格后指定我想要一个字符串。

假设我有一个字符串

"6/19/2017 12:14:07 PM 0FA0 PACKET 0000000DF5EC3D80 UDP Snd 11.222.333.44 93c8 R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)"

我想要的结果是这样的：

group 1 - 6/19/2017 12:14:07 PM
group 2 - 0FA0
group 3 - PACKET 0000000DF5EC3D80
group 4 - UDP
group 5 - Snd
group 6 - 11.222.333.44
group 7 - 93c8
group 8 - R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-
          addr(4)arpa(0)

任何人都可以帮助我。提前谢谢。

Answer 1

如果它只是某些空间你想要分隔符，你可以做这样的事情来避免固定宽度的噩梦：

regex = "(\S+\s\S+\s\S+)\s(\S+)\s(\S+\s\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(.*)"

几乎就是它的样子，NON空格的组\ S +有空格\ s，每个都用parans分组。最后的。*只是该行的其余部分，可以根据需要进行调整。如果您希望每个组都是非间隔组，则可以执行拆分而不是正则表达式，但看起来并非如此。我无法访问nifi进行测试，但这是Python中的一个示例。

import re

text = "6/19/2017 12:14:07 PM 0FA0 PACKET 0000000DF5EC3D80 UDP Snd 11.222.333.44 93c8 R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)"
regex = "(\S+\s\S+\s\S+)\s(\S+)\s(\S+\s\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(.*)"

match = re.search(regex, text)
print ("group 1 - " + match.group(1))
print ("group 2 - " + match.group(2))
print ("group 3 - " + match.group(3))
print ("group 4 - " + match.group(4))
print ("group 5 - " + match.group(5))
print ("group 6 - " + match.group(6))
print ("group 7 - " + match.group(7))
print ("group 8 - " + match.group(8))

输出：

group 1 - 6/19/2017 12:14:07 PM
group 2 - 0FA0
group 3 - PACKET 0000000DF5EC3D80
group 4 - UDP
group 5 - Snd
group 6 - 11.222.333.44
group 7 - 93c8
group 8 - R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)

Answer 2

您是否尝试将每个组提取到一个单独的属性中？这在“纯粹的”NiFi中肯定是可能的，但是由于线条很长，使用ExecuteScript处理器将Groovy或Python的更复杂的正则表达式处理与String#split()结合使用可能更有意义。发布了 sniperd 等脚本。

要使用ExtractText执行此任务，您将按以下方式对其进行配置：

可复制的模式：

group 1: (^\S+\s\S+\s\S+)
group 2: (?i)(?<=\s)([a-f0-9]{4})(?=\s)
group 3: (?i)(?<=\s)(PACKET\s[a-f0-9]{4,16})(?=\s)
group 4: (?i)(?<=\s\S{16}\s)([\w]{3,})(?=\s)
group 5: (?i)(?<=\s.{3}\s)([\w]{3,})(?=\s)
group 6: (?i)(?<=\s.{3}\s)([\d\.]{7,15})(?=\s)
group 7: (?i)(?<=\d\s)([a-f0-9]{4})(?=\s)
group 8: (?i)(?<=\d\s[a-f0-9]{4}\s)(.*)$

请务必注意Include Capture Group 0设置为false。由于在NiFi中验证正则表达式的方式，您将获得重复的组（group 1和group 1.1）（目前所有正则表达式必须至少有一个捕获组 - 这将用NIFI-4095 | ExtractText should not require a capture group in every regular expression修复。

生成的流文件具有正确填充的属性：

完整日志输出：

2017-06-20 14:45:57,050 INFO [Timer-Driven Process Thread-9] o.a.n.processors.standard.LogAttribute LogAttribute[id=c6b04310-015c-1000-b21e-c64aec5b035e] logging for flow file StandardFlowFileRecord[uuid=5209cc65-08fe-44a4-be96-9f9f58ed2490,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1497984255809-1, container=default, section=1], offset=444, length=148],offset=0,name=1920315756631364,size=148]
--------------------------------------------------
Standard FlowFile Attributes
Key: 'entryDate'
    Value: 'Tue Jun 20 14:45:10 EDT 2017'
Key: 'lineageStartDate'
    Value: 'Tue Jun 20 14:45:10 EDT 2017'
Key: 'fileSize'
    Value: '148'
FlowFile Attribute Map Content
Key: 'filename'
    Value: '1920315756631364'
Key: 'group 1'
    Value: '6/19/2017 12:14:07 PM'
Key: 'group 1.1'
    Value: '6/19/2017 12:14:07 PM'
Key: 'group 2'
    Value: '0FA0'
Key: 'group 2.1'
    Value: '0FA0'
Key: 'group 3'
    Value: 'PACKET 0000000DF5EC3D80'
Key: 'group 3.1'
    Value: 'PACKET 0000000DF5EC3D80'
Key: 'group 4'
    Value: 'UDP'
Key: 'group 4.1'
    Value: 'UDP'
Key: 'group 5'
    Value: 'Snd'
Key: 'group 5.1'
    Value: 'Snd'
Key: 'group 6'
    Value: '11.222.333.44'
Key: 'group 6.1'
    Value: '11.222.333.44'
Key: 'group 7'
    Value: '93c8'
Key: 'group 7.1'
    Value: '93c8'
Key: 'group 8'
    Value: 'R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)'
Key: 'group 8.1'
    Value: 'R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)'
Key: 'path'
    Value: './'
Key: 'uuid'
    Value: '5209cc65-08fe-44a4-be96-9f9f58ed2490'
--------------------------------------------------
6/19/2017 12:14:07 PM 0FA0 PACKET 0000000DF5EC3D80 UDP Snd 11.222.333.44 93c8 R Q [8085 A DR NOERROR] PTR (2)73(3)191(3)250(2)10(7)in-addr(4)arpa(0)

发布NiFi 1.3.0的另一个选择是使用record processing capabilities。这是一项新功能，允许以流方式解析和操作任意输入格式（Avro，JSON，CSV等）。 Mark Payne编写了一个very good tutorial here来介绍该功能，并提供了一些简单的演练。

字符串使用正则表达式分隔空格

2 个答案: