使用部分字符串的grep(带数字)在数据表中创建列

时间:2017-03-29 18:39:10

标签: r regex string grep data.table

如何使用grep查找部分字符串匹配并仅将该字符串的一部分返回到现有数据表中的新列?
例如,我在dt中有一列$ remarks:“FC3”
一些言论说“等等等等等等等等等等。” 是否有一个声明我可以用来抓住'57 DAYS LATE'部分并把它放在一个新专栏中?当然,它并不总是57天,有时是145天,有时只有8天 - 所以弦的长度是动态的。

每个请求:这里是示例/可重现数据(我认为这是你要求的)

7548    1D10000 2016    2016 CAT 1 WAS SUBMITTED 9 DAYS LATE  
3647    1D10001 2011    PENALTY PAID   
3547    1D39949 2013    2013 CAT 1 WAS 57 DAYS LATE SUBMIT  

这里,包含字符串“2016 CAT 1 WAS SUBITTED 9 DAYS LATE”的列,以及各自行中的字符串“2013 CAT 1 WAS 57 DAYS LATE SUBMIT”是我所指的字符串。对于我来说,最好能够在新专栏中搜索,抓取并放置“9天晚”或“57天晚”字符串?

包含我想要的字符串的列名是FC4 $ remarks

谢谢,我希望这可以澄清!

1 个答案:

答案 0 :(得分:0)

您可以使用gsub()和捕获表达式,如下所示:

dt <- data.table(remarks = c("blah blah blah 57 DAYS LATE blah blah",
                             "blah blah blah 145 DAYS LATE blah blah",
                             "2013 CAT 1 WAS 123 DAYS LATE SUBMIT",
                             "2016 CAT 1 WAS SUBMITTED 9 DAYS LATE"))

dt$new_column <- gsub(".* (\\d+ DAYS LATE).*", "\\1", dt$remarks)
# captures one or more consecutive digits and the string " DAYS LATE"

dt
                                  remarks    new_column
1:  blah blah blah 57 DAYS LATE blah blah  57 DAYS LATE
2: blah blah blah 145 DAYS LATE blah blah 145 DAYS LATE
3:    2013 CAT 1 WAS 123 DAYS LATE SUBMIT 123 DAYS LATE
4:   2016 CAT 1 WAS SUBMITTED 9 DAYS LATE   9 DAYS LATE

sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] data.table_1.9.6 magrittr_1.5     dplyr_0.5.0      purrr_0.2.2      readr_1.0.0      tidyr_0.6.1      tibble_1.2      
 [8] tidyverse_1.0.0  ggmap_2.7        ggplot2_2.2.1