我想提取[
和,
之间的值,并将这些提取的值放在新列(col2
)中。
我不反对使用stringr
代替基地。
示例数据:
df <- structure(list(t = structure(1:2, .Label = c("v1", "v2"), class = "factor"),
d = structure(1:2, .Label = c("something[123,894]", "something[456,4834]"
), class = "factor")), .Names = c("t", "d"), row.names = c(NA,
-2L), class = "data.frame")
看起来像:
t d
1 v1 something[123,894]
2 v2 something[456,4834]
现在,我想制作一个新列(df$r
)并将123
和v1
的值456
提取为v2
{{1} }}
我确信有一种简单的方法可以使用正则表达式搜索df$r
和[
来执行此操作,但我使用,
并不是很好。
感谢您的帮助。
-cherrytree
答案 0 :(得分:4)
df <- structure(list(t = structure(1:2, .Label = c("v1", "v2"), class = "factor"),
d = structure(1:2, .Label = c("something[123,894]", "something[456,4834]"
), class = "factor")), .Names = c("t", "d"), row.names = c(NA,
-2L), class = "data.frame")
这将匹配任意字符.*
和[
,然后捕获到组\\1
一个或多个数字\\d+
,结束捕获组,然后任何次数的任何字符
df$r <- gsub('.*\\[(\\d+).*', '\\1', df$d)
# t d r
# 1 v1 something[123,894] 123
# 2 v2 something[456,4834] 456
另外,如果你想在逗号后面捕获第二个数字串,这会更有用:
gsub('.*\\[(\\d+),(\\d+).*', '\\1', df$d)
# [1] "123" "456"
gsub('.*\\[(\\d+),(\\d+).*', '\\2', df$d)
# [1] "894" "4834"
或者,如果你想一气呵成:
cbind(df, do.call('rbind', lapply(strsplit(as.character(df$d), ','),
function(x) gsub('\\D', '', x))))
# t d 1 2
# 1 v1 something[123,894] 123 894
# 2 v2 something[456,4834] 456 4834
This解释得比我好:
NODE EXPLANATION
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\[ '['
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
, ','
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))