我在R中有以下数据框:
text <- c("[AAA]xxxx", "[AAA] yyyrrr", "[AAA][bbb] bla", "[AAA][bbb] cccvvv",
"[AAA][bbb] bla", "[AAA][bbb][CcC] bla", "[AAA][bbb][CcC] xbbpr")
value <- rnorm(7)
df <- data.frame(text, value)
我想在我的数据框中为第一,第二和第三个括号对中包含的文本创建三个新变量。
所需的输出如下所示:
text value Bracket1 Bracket2 Bracket3
1 [AAA]xxxx -0.01819034 AAA NA NA
2 [AAA] yyyrrr -0.24808460 AAA NA NA
3 [AAA][bbb] bla -0.36293689 AAA bbb NA
4 [AAA][bbb] cccvvv 1.27757055 AAA bbb NA
5 [AAA][bbb] bla -0.46889715 AAA bbb NA
6 [AAA][bbb][CcC] bla 0.07105410 AAA bbb CcC
7 [AAA][bbb][CcC] xbbpr -0.26603845 AAA bbb CcC
我无法从第一个括号中提取文本,更不用说第二个或第三个。
例如,我尝试过:
df$Bracket1 <- gsub('.*\\[(.*)\\].*', '\\1', text)
和
df$Bracket1 <- sub('.*\\[(.*)\\].*', '\\1', text)
但这些都会产生:
text value Bracket1
1 [AAA]xxxx -0.01819034 AAA
2 [AAA] yyyrrr -0.24808460 AAA
3 [AAA][bbb] bla -0.36293689 bbb
4 [AAA][bbb] cccvvv 1.27757055 bbb
5 [AAA][bbb] bla -0.46889715 bbb
6 [AAA][bbb][CcC] bla 0.07105410 CcC
7 [AAA][bbb][CcC] xbbpr -0.26603845 CcC
我对正则表达式来说是全新的,对R来说相对较新,请提前感谢任何建议。
答案 0 :(得分:1)
答案 1 :(得分:1)
以下是使用gregexpr
和regmatches
的方法:
mtchs <- regmatches(df$text, gregexpr("\\[\\w+\\]", df$text))
然后只需将输出重新组织为所需的结构:
library(plyr) # for rbind.fill
df[,3:5] <- do.call(rbind.fill,
lapply(mtchs, function(xx) {x <- data.frame(matrix(xx, nrow=1))
names(x) <- paste0("Bracket", 1:length(xx))
x}))
# or using dplyr's bind_row:
library(dplyr)
df[,3:5] <- bind_rows(lapply(mtchs, function(xx) {x <- data.frame(matrix(xx, nrow=1))
names(x) <- paste0("Bracket", 1:length(xx))
x}))
# or using data.table's rbindlist:
library(data.table)
df[,3:5] <- rbindlist(lapply(mtchs, function(xx) {x <- data.frame(matrix(xx, nrow=1))
names(x) <- paste0("Bracket", 1:length(xx))
x}), fill=TRUE)
如果需要,您可以更改regmatches
中的正则表达式以删除括号:
mtchs <- regmatches(df$text, gregexpr("(?<=\\[)\\w+(?=\\])", df$text, perl=TRUE))
答案 2 :(得分:1)
使用transpose()
包中的data.table
:
require(data.table) # v1.9.6+
dt = data.table(text, value) # text is character
vals = regmatches(dt$text, gregexpr("(?<=\\[)[[:alpha:]]+(?=])", dt$text, perl=TRUE))
dt[, paste0("Bracket", 1:3) := transpose(vals)]
# text value Bracket1 Bracket2 Bracket3
# 1: [AAA]xxxx -0.9285790 AAA NA NA
# 2: [AAA] yyyrrr 0.7928830 AAA NA NA
# 3: [AAA][bbb] bla 0.1177066 AAA bbb NA
# 4: [AAA][bbb] cccvvv 1.1818542 AAA bbb NA
# 5: [AAA][bbb] bla -0.4476371 AAA bbb NA
# 6: [AAA][bbb][CcC] bla 2.2992593 AAA bbb CcC
# 7: [AAA][bbb][CcC] xbbpr 2.1161453 AAA bob CcC