提取CSV列的值以添加属性

时间:2019-06-04 17:49:49

标签: regex csv apache-nifi regex-lookarounds regex-group

我正在NiFi中处理一些CSV,并且我的管道正在生成一些重复项。结果,我想使用DetectDuplicate处理器,但是为了做到这一点,我需要具有一些可以比较以检测重复的属性。我有一个ExtractText处理器,我想使用regex来获取SHA1_BASE16列中的值。

我在下面的CSV上尝试了以下正则表达式字符串(由朋友建议,我不完全理解),但突出显示了错误的字段和一些多余的内容。如何获取仅捕获SHA1_BASE16值的{em> ?

RegEx

^[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,([^,]*),[^,]*,[^,]*,[^,]*,[^,]*,[^,]*,([^,]*),[^,]*,[^,]*,([^,]*)\S*

CSV

"USER_JID","CREATED_AT","UPLOAD_TIME","SHA1_BASE32","SHA1_BASE16","HASH_SOURCE","MESSAGE_TYPE","IPV4"
"dreynolds","1932/04/01 20:23:35 UTC","2016/12/28 20:23:11 UTC","72F20077A79A0D4D90F4C0669FB6EA4BC5953293","FB1D928B83DEBCD2B2E53DF4C8D4C2AF99EB81E0","HOLLYWOOD","TWITTER","123.123.123.123"

实际输出

Match 1
Full match  0-291   "USER_JID","CREATED_AT","UPLOAD_TIME","SHA1_BASE32","SHA1_BASE16","HASH_SOURCE","MESSAGE_TYPE","IPV4...
Group 1.    66-79   "HASH_SOURCE"
Group 2.    209-251 "FB1D928B83DEBCD2B2E53DF4C8D4C2AF99EB81E0"
Group 3.    274-291 "123.123.123.123"

预期产量

Match 1
Full match  0-291   "USER_JID","CREATED_AT","UPLOAD_TIME","SHA1_BASE32","SHA1_BASE16","HASH_SOURCE","MESSAGE_TYPE","IPV4...
Group 1.    209-251 "FB1D928B83DEBCD2B2E53DF4C8D4C2AF99EB81E0"

2 个答案:

答案 0 :(得分:1)

我猜我们这里将有两个40个字符的字符串,我们将第一个作为左边界,并应用以下简单表达式:

.+"[A-Z0-9]{40}",("[A-Z0-9]{40}").+

我们所需的输出在此捕获组中:

("[A-Z0-9]{40}")

我们可以使用$1

Demo

测试

const regex = /.+"[A-Z0-9]{40}",("[A-Z0-9]{40}").+/gm;
const str = `"dreynolds","1932/04/01 20:23:35 UTC","2016/12/28 20:23:11 UTC","72F20077A79A0D4D90F4C0669FB6EA4BC5953293","FB1D928B83DEBCD2B2E53DF4C8D4C2AF99EB81E0","HOLLYWOOD","TWITTER","123.123.123.123"`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

RegEx电路

jex.im可视化正则表达式:

enter image description here

答案 1 :(得分:0)

或者,您可以使用PartitionRecord将记录拆分为流文件,其中每个记录的分区字段值相同(在这种情况下为SHA1_BASE16)。它还将在流文件上为分区值设置一个属性,然后可以在DetectDuplicate中使用该属性。

对于高基数字段(不会有很多重复的字段),您可能会看到性能下降,因为每个传出流文件中可能只有一行,因此对于大量行,您将获得大量流文件。话虽如此,而不是下游的DetectDuplicate,您可以改为RouteOnAttribute其中record.count>1。这样就不需要DistributedMapCache。

还有一个contribution to add a DetectDuplicateRecord processor,我想这是您真正想要的。这项贡献正在接受审核,我希望可以将其纳入下一版NiFi。