我想根据另一列中的字符串的一部分创建一个列。
参考栏的格式如下:GB / Ling 31st Dec
我想在这种情况下提取“Ling”这个词并且长度各不相同。
到目前为止我的方法是:
library(data.table)
d1 <- data.table(MENU_HINT =
c("GB / Ling 31st Dec", "GB / Taun 30th Dec",
"GB / Ayr 19th Dec", "GB / Ayr 9th Nov",
"GB / ChelmC 29th Sep"),
Track = c("Ling", "Taun", "Ayr", "Ayr", "ChelmC"))
#remove all the spaces
d1[, Track2 := gsub("[[:space:]]", "", MENU_HINT)]
# get the position of the first digit
d1[, x := as.numeric(regexpr("[[:digit:]]", Track2)[[1]])]
# get the position of the '/'
d1[, y := as.numeric(regexpr("/", Track2))[[1]]]
# use above to extract the Track
d1[, Track2 := substr(Track2, y + 1, x - 1)]
Track是我期望获得的,Track2是我从上面的代码中得到的。
这似乎很长,并且似乎也不起作用,因为整个列中的x和y值是相同的。
答案 0 :(得分:4)
我不会使用正则表达式 - 它对大数据集效率不高。看起来你要找的单词总是位于第二个空格之后。一个非常简单有效的解决方案可能是
d1[, Track2 := tstrsplit(MENU_HINT, " ", fixed = TRUE)[[3]]]
<强>基准强>
bigDT <- data.table(MENU_HINT = sample(d1$MENU_HINT, 1e6, replace = TRUE))
microbenchmark::microbenchmark("sub: " = sub("\\S+[[:punct:] ]+(\\S+).*", "\\1", bigDT$MENU_HINT),
"gsub: " = gsub("^[^/]+/\\s*|\\s+.*$", "", bigDT$MENU_HINT),
"tstrsplit: " = tstrsplit(bigDT$MENU_HINT, " ", fixed = TRUE)[[3]])
# Unit: milliseconds
# expr min lq mean median uq max neval
# sub: 982.1185 998.6264 1058.1576 1025.8775 1083.1613 1405.051 100
# gsub: 1236.9453 1262.6014 1320.4436 1305.6711 1339.2879 1766.027 100
# tstrsplit: 385.4785 452.6476 498.8681 470.8281 537.5499 1044.691 100
答案 1 :(得分:2)
我们可以使用sub
d1[, Track2 := sub("\\S+[[:punct:] ]+(\\S+).*", "\\1", MENU_HINT)]
或gsub
d1[, Track2 := gsub("^[^/]+/\\s*|\\s+.*$", "", MENU_HINT)]
d1
# MENU_HINT Track Track2
#1: GB / Ling 31st Dec Ling Ling
#2: GB / Taun 30th Dec Taun Taun
#3: GB / Ayr 19th Dec Ayr Ayr
#4: GB / Ayr 9th Nov Ayr Ayr
#5: GB / ChelmC 29th Sep ChelmC ChelmC