在另一个字段上使用RegEx在R data.table中创建新字段

时间:2018-06-08 21:16:53

标签: r regex data.table

鉴于此data.table

library(data.table)

dt <- data.table(f1 =  c(
  "stuffstuff-0000097125",
  "stuffstuff.abc.0006496679",
  "stuffstuff0007517235",
  "stuffstuff_xyz.0007280719",
  "stuffstuff0005995303",
  "stuffstuff_a1b_0000143856",
  "stuffstuff0009362407",
  "stuffstuff.c44_0009735298"
))

希望获得这些结果:

                          f1 parsed_val
1:     stuffstuff-0000097125        
2: stuffstuff.abc.0006496679        abc
3:      stuffstuff0007517235        
4: stuffstuff_xyz.0007280719        xyz
5:      stuffstuff0005995303        
6: stuffstuff_a1b_0000143856        a1b
7:      stuffstuff0009362407        
8: stuffstuff.c44_0009735298        c44

以下是我的尝试:

rex_pattern <- "(?<=(\\.|\\_|\\-))[A-Za-z0-9]{3}(?=(\\.|\\_|\\-)[0-9]{3,})"

dt[, `:=`(parsed_val = regmatches(f1, regexpr(pattern = rex_pattern, f1, perl = TRUE)))]  

然而,由于回收利用,这些是我得到的结果:

                          f1 parsed_val
1:     stuffstuff-0000097125        abc
2: stuffstuff.abc.0006496679        xyz
3:      stuffstuff0007517235        a1b
4: stuffstuff_xyz.0007280719        c44
5:      stuffstuff0005995303        abc
6: stuffstuff_a1b_0000143856        xyz
7:      stuffstuff0009362407        a1b
8: stuffstuff.c44_0009735298        c44

我尝试在函数中使用ifelse来返回空字符串:

getMmFromFilename <- function(my_file_name){
rex_pattern <- "(?<=(\\.|\\_|\\-))[A-Za-z0-9]{3}(?=(\\.|\\_|\\-)[0-9]{3,})"
nothing_found <- character(length = 0)

mm <- regmatches(my_file_name, regexpr(pattern = rex_pattern, my_file_name, perl = TRUE))
ifelse(identical(mm, nothing_found), "missing_Mm", mm)
}

dt[, .(parsed_val = getMmFromFilename(f1))]

但这只返回abc的1个值。 regmatches data.table表示:“对于向量匹配数据(从regexpr获取),将删除空匹配;对于列表匹配数据,空匹配给出空组件(零长度字符向量)。”我猜这个解决方案就在这里,但我还没有得到它......

至于解决方案,我的工作流程要求我使用public async Task<ShortUrl> GetAsync(string code) { var filterBuilder = new FilterDefinitionBuilder<ShortUrl>(); var filter = filterBuilder.Eq(s => s.Code, code); var cursor = await _db.Urls.FindAsync(filter); return await cursor.FirstOrDefaultAsync(); } ,对解决方案的简要解释将是一个巨大的帮助......

提前致谢。

1 个答案:

答案 0 :(得分:1)

dt[,parser_val:=sub(".*?[._](.*)[._].*|.*","\\1",f1)]
dt
                          f1 parser_val
1:     stuffstuff-0000097125           
2: stuffstuff.abc.0006496679        abc
3:      stuffstuff0007517235           
4: stuffstuff_xyz.0007280719        xyz
5:      stuffstuff0005995303           
6: stuffstuff_a1b_0000143856        a1b
7:      stuffstuff0009362407           
8: stuffstuff.c44_0009735298        c44

如果您想使用regmatches,可以pattern="(?<=[._]).*(?=[._])|$"使用perl=TRUE

dt[,parser_val:=regmatches(dt$f1,regexpr("(?<=[._]).*(?=[._])|$",dt$f1,perl = T))]
> dt
                          f1 parser_val
1:     stuffstuff-0000097125           
2: stuffstuff.abc.0006496679        abc
3:      stuffstuff0007517235           
4: stuffstuff_xyz.0007280719        xyz
5:      stuffstuff0005995303           
6: stuffstuff_a1b_0000143856        a1b
7:      stuffstuff0009362407           
8: stuffstuff.c44_0009735298        c44