正则表达式删除空白行

时间:2018-09-19 19:39:52

标签: regex unix apache-nifi

我正在尝试删除单个正则表达式中的空白行和无效记录。但这似乎不起作用。在下面的示例中,包含Serverserial:0和ServerName:“”空的记录是无效记录,

{"eventType":"delete","ServerSerial":"1142691750","ServerName":"XYZ_P_O","deletedat":"2018-08-24 15:30:48.136"},
{"eventType":"delete","ServerSerial":"0","ServerName":"","deletedat":"2018-08-24 15:30:48.136"},
{"eventType":"delete","ServerSerial":"1142691950","ServerName":"ABC_P_1","deletedat":"2018-08-24 15:30:48.136"},
{"eventType":"delete","ServerSerial":"0","ServerName":"","deletedat":"2018-08-24 15:30:48.136"},
{"eventType":"delete","ServerSerial":"0","ServerName":"","deletedat":"2018-08-24 15:30:48.136"},
{"eventType":"delete","ServerSerial":"1142691750","ServerName":"COL_P_1","deletedat":"2018-08-24 15:30:48.136"}

通过使用以下正则表达式,它仅删除无效的条目,但不会删除迹线(空白)

.*(?<=ServerSerial":")0(?=").*|.*(?<=ServerName":")(?=").*

并且也尝试过此操作,没有运气

.*(?<=ServerSerial":")0(?=").*[\r\n]*|.*(?<=ServerName":")(?=").*[\r\n]*

当前输出类似于空白行

{"eventType":"delete","ServerSerial":"1142691750","ServerName":"XYZ_P_O","deletedat":"2018-08-24 15:30:48.136"},

{"eventType":"delete","ServerSerial":"1142691950","ServerName":"ABC_P_1","deletedat":"2018-08-24 15:30:48.136"},


{"eventType":"delete","ServerSerial":"1142691750","ServerName":"COL_P_1","deletedat":"2018-08-24 15:30:48.136"}

但是预期的输出是

{"eventType":"delete","ServerSerial":"1142691750","ServerName":"XYZ_P_O","deletedat":"2018-08-24 15:30:48.136"},
{"eventType":"delete","ServerSerial":"1142691950","ServerName":"ABC_P_1","deletedat":"2018-08-24 15:30:48.136"},
{"eventType":"delete","ServerSerial":"1142691750","ServerName":"COL_P_1","deletedat":"2018-08-24 15:30:48.136"}

6 个答案:

答案 0 :(得分:0)

  

方法1:使用1个ReplaceText处理器:

由于我使用了您在问题中提到的一个正则表达式。

将ReplaceText处理器配置为

搜索值

appendChild()

替换价值

(?<=ServerSerial":")0(?=").*[\r\n]*|.*(?<=ServerName":")(?=").*[\r\n]

enter image description here

输入:

${literal("")} //as we are not having any capture groups so i have used empty value for replacing.

输出:

{"eventType":"delete","ServerSerial":"1142691750","ServerName":"XYZ_P_O","deletedat":"2018-08-24 15:30:48.136"},
{"eventType":"delete","ServerSerial":"0","ServerName":"","deletedat":"2018-08-24 15:30:48.136"},
{"eventType":"delete","ServerSerial":"1142691950","ServerName":"ABC_P_1","deletedat":"2018-08-24 15:30:48.136"},
{"eventType":"delete","ServerSerial":"0","ServerName":"","deletedat":"2018-08-24 15:30:48.136"},
{"eventType":"delete","ServerSerial":"0","ServerName":"","deletedat":"2018-08-24 15:30:48.136"},
{"eventType":"delete","ServerSerial":"1142691750","ServerName":"COL_P_1","deletedat":"2018-08-24 15:30:48.136"}
  

方法2:使用QueryRecord处理程序:

如果您知道数据的架构,则可以使用QueryRecord处理器,然后在QueryRecord处理器中添加新属性为

{"eventType":"delete","ServerSerial":"1142691750","ServerName":"XYZ_P_O","deletedat":"2018-08-24 15:30:48.136"},
{"eventType":"delete","ServerSerial":"1142691950","ServerName":"ABC_P_1","deletedat":"2018-08-24 15:30:48.136"},
{"eventType":"delete","ServerSerial":"1142691750","ServerName":"COL_P_1","deletedat":"2018-08-24 15:30:48.136"}

然后,处理器输出流文件,其中包含满足上述sql查询的记录。

  

方法3:串联使用2个ReplaceText处理器:

使用 ReplaceText 处理器进行以下配置:

搜索值

select * from FLOWFILE where ServerName is not null and ServerSerial > 0

替换价值

\n+\s+

字符集

shift+enter

最大缓冲区大小

UTF-8

替换策略

1 MB //needs to change this values as per your flowfile size

评估模式

Regex Replace

enter image description here

我在本地实例中尝试了以下数据

输入流文件内容:

Entire text

输出流文件内容:

{"eventType":"delete","ServerSerial":"1142691750","ServerName":"XYZ_P_O","deletedat":"2018-08-24 15:30:48.136"},

{"eventType":"delete","ServerSerial":"1142691950","ServerName":"ABC_P_1","deletedat":"2018-08-24 15:30:48.136"},


{"eventType":"delete","ServerSerial":"1142691750","ServerName":"COL_P_1","deletedat":"2018-08-24 15:30:48.136"}

请参考this链接,以替换流文件中的空行。

答案 1 :(得分:0)

将此附加到第二个正则表达式:

(?<=[\r\n])[\r\n]|

通过删除换行符后再换一行来删除空行。

答案 2 :(得分:0)

当您将文件转换为UNIX文件时,可以使用

grep -Ev 'ServerSerial":"0?"|ServerName":"0?"' inputfile

答案 3 :(得分:0)

您可以通过以下方式忽略这些空白行。

使用ReplaceText处理器。

Search:  \n\n\s|\n\s

Replace:  \n

http://regexr.com/3fbst

参考:How to use regex to remove the spaces between two rows?

如果您遇到任何问题,请告诉我。

答案 4 :(得分:0)

如果您的所有记录都是基于行的,则可以使用Perl解决。使用Perl 单线解决方案,我们可以将十六进制\ x22用作双引号。请参阅以下内容是否适合您。我还为您的输入添加了空行。

>cat regex_event.dat
{"eventType":"delete","ServerSerial":"1142691750","ServerName":"XYZ_P_O","deletedat":"2018-08-24 15:30:48.136"},
{"eventType":"delete","ServerSerial":"0","ServerName":"","deletedat":"2018-08-24 15:30:48.136"},

{"eventType":"delete","ServerSerial":"1142691950","ServerName":"ABC_P_1","deletedat":"2018-08-24 15:30:48.136"},
{"eventType":"delete","ServerSerial":"0","ServerName":"","deletedat":"2018-08-24 15:30:48.136"},
{"eventType":"delete","ServerSerial":"0","ServerName":"","deletedat":"2018-08-24 15:30:48.136"},

{"eventType":"delete","ServerSerial":"1142691750","ServerName":"COL_P_1","deletedat":"2018-08-24 15:30:48.136"}
>
>perl -ne ' s/^\s*$//g; print if length($_) > 0 and not m/\x22ServerSerial\x22:\x220\x22,\x22ServerName\x22:\x22\x22/' regex_event.dat
{"eventType":"delete","ServerSerial":"1142691750","ServerName":"XYZ_P_O","deletedat":"2018-08-24 15:30:48.136"},
{"eventType":"delete","ServerSerial":"1142691950","ServerName":"ABC_P_1","deletedat":"2018-08-24 15:30:48.136"},
{"eventType":"delete","ServerSerial":"1142691750","ServerName":"COL_P_1","deletedat":"2018-08-24 15:30:48.136"}
>

答案 5 :(得分:0)

从NiFi 1.7.0开始(通过NIFI-4456),您可以将JsonTreeReader配置为在格式输入时读取“每行一个JSON”。然后,您可以使用QueryRecord发出SQL查询来路由记录但是,例如,您喜欢查询SELECT * FROM FLOWFILE WHERE ServerSerial = 0 AND ServerName = ""的“无效”属性和查询SELECT * FROM FLOWFILE WHERE ServerSerial <> 0 OR ServerName <> ""的“有效”属性,等等。