使用regex在R中提取字符串以获取可变数量的实例

时间:2018-12-05 06:29:31

标签: r regex variables

我有一个字符串列(data.table),我需要根据模式(“-”之间的文本)和该模式的定义(但可变)数量的实例进行解析,我不确定使用正则表达式的方法:

> test <- c("AAA-bb-ccc", "abcd-efgh","blah", "blah-blah-blah-blah")

假设,实例的预定义数量为i。

> i = 1
> output
"AAA"  "abcd"  "blah"  "blah

> i = 2
> output
"bb"  "efgh"  ""  "blah"


> i= 3
> output
"ccc"  ""  ""  "blah"

我将如何使用通用正则表达式来实现这一目标?

3 个答案:

答案 0 :(得分:1)

我们可以创建一个在“-”上分割并返回第i个值的函数。

CREATE TABLE dbo.CCTemp
(
ID INT IDENTITY(1,1),
CCName VARCHAR(100),
[Level] INT,
ParentID INT
)

INSERT INTO DBO.CCTemp (CCName,Level,ParentID) values ('Services Total - 2018',1,NULL)
INSERT INTO DBO.CCTemp (CCName,Level,ParentID) values ('2018_9UKDT - UKD Expense Total',2,1)
INSERT INTO DBO.CCTemp (CCName,Level,ParentID) values ('2018_9Q400 - UKD Indirects',3,2)
INSERT INTO DBO.CCTemp (CCName,Level,ParentID) values ('98064 - IT SDS Costs',4,3)
INSERT INTO DBO.CCTemp (CCName,Level,ParentID) values ('98063 - ACS in charges',4,3)
INSERT INTO DBO.CCTemp (CCName,Level,ParentID) values ('98012 - UKD - Central',4,3)
INSERT INTO DBO.CCTemp (CCName,Level,ParentID) values ('2018_9Q300 - UKD Non Opex Total',3,2)
INSERT INTO DBO.CCTemp (CCName,Level,ParentID) values ('98024 - Commission',4,7)
INSERT INTO DBO.CCTemp (CCName,Level,ParentID) values ('98013 - Affiliates Commission',4,7)
INSERT INTO DBO.CCTemp (CCName,Level,ParentID) values ('2018_9Q200 - Digital Functions Total',3,2)
INSERT INTO DBO.CCTemp (CCName,Level,ParentID) values ('2018_9QB41 - Marketing',4,10)
INSERT INTO DBO.CCTemp (CCName,Level,ParentID) values ('2018_9QB4F - UKD Marketing General Function',5,11)
INSERT INTO DBO.CCTemp (CCName,Level,ParentID) values ('98141 - Marketing Stratey, Planning and Supplier Management',6,12)
INSERT INTO DBO.CCTemp (CCName,Level,ParentID) values ('98031 - UKD Cost Challenge (RM)',6,12)

答案 1 :(得分:1)

对于i=3,您可以尝试

unlist(lapply(strsplit(test,split = '-'),'[',3)) 
[1] "ccc"  NA     NA     "blah"

答案 2 :(得分:1)

我们还可以使用tokenize_regex包中的tokenizers,然后将data.table::transposecbind相关列放入data.table

test <- c("AAA-bb-ccc", "abcd-efgh","blah", "blah-blah-blah-blah")

library(tokenizers)
library(data.table)
test <- transpose(tokenize_regex(test, "-"), fill = "")

i <- 1:3
as.data.table(do.call(cbind, test[i]))
#     V1   V2   V3
#1:  AAA   bb  ccc
#2: abcd efgh
#3: blah
#4: blah blah blah