如何使用grep查找部分字符串匹配并仅将该字符串的一部分返回到现有数据表中的新列?
例如,我在dt中有一列$ remarks:“FC3”
一些言论说“等等等等等等等等等等。”
是否有一个声明我可以用来抓住'57 DAYS LATE'部分并把它放在一个新专栏中?当然,它并不总是57天,有时是145天,有时只有8天 - 所以弦的长度是动态的。
每个请求:这里是示例/可重现数据(我认为这是你要求的)
7548 1D10000 2016 2016 CAT 1 WAS SUBMITTED 9 DAYS LATE
3647 1D10001 2011 PENALTY PAID
3547 1D39949 2013 2013 CAT 1 WAS 57 DAYS LATE SUBMIT
这里,包含字符串“2016 CAT 1 WAS SUBITTED 9 DAYS LATE”的列,以及各自行中的字符串“2013 CAT 1 WAS 57 DAYS LATE SUBMIT”是我所指的字符串。对于我来说,最好能够在新专栏中搜索,抓取并放置“9天晚”或“57天晚”字符串?
包含我想要的字符串的列名是FC4 $ remarks
谢谢,我希望这可以澄清!
答案 0 :(得分:0)
您可以使用gsub()
和捕获表达式,如下所示:
dt <- data.table(remarks = c("blah blah blah 57 DAYS LATE blah blah",
"blah blah blah 145 DAYS LATE blah blah",
"2013 CAT 1 WAS 123 DAYS LATE SUBMIT",
"2016 CAT 1 WAS SUBMITTED 9 DAYS LATE"))
dt$new_column <- gsub(".* (\\d+ DAYS LATE).*", "\\1", dt$remarks)
# captures one or more consecutive digits and the string " DAYS LATE"
dt
remarks new_column
1: blah blah blah 57 DAYS LATE blah blah 57 DAYS LATE
2: blah blah blah 145 DAYS LATE blah blah 145 DAYS LATE
3: 2013 CAT 1 WAS 123 DAYS LATE SUBMIT 123 DAYS LATE
4: 2016 CAT 1 WAS SUBMITTED 9 DAYS LATE 9 DAYS LATE
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.1
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.6 magrittr_1.5 dplyr_0.5.0 purrr_0.2.2 readr_1.0.0 tidyr_0.6.1 tibble_1.2
[8] tidyverse_1.0.0 ggmap_2.7 ggplot2_2.2.1