这是我的小词
protein patient value
<chr> <chr> <dbl>
1 BOD1L2 RF0064_Case-9-d- 10.4
2 PPFIA2 RF0064_Case-20-d- 7.83
3 STAT4 RF0064_Case-11-d- 11.0
4 TOM1L2 RF0064_Case-29-d- 13.0
5 SH2D2A RF0064_Case-2-d- 8.28
6 TIGD4 RF0064_Case-49-d- 9.71
在“患者”列中,如“案例x-d”中的“ d”表示天数。我想做的是创建一个新列,说明“患者”列中的字符串是否包含小于14d的值。
我已经使用以下命令设法做到了:
under14 <- "-1d|-2d|-3d|-4d|-4d|-5d|-6d|-7d|-8d|-9d|-11d|-12d|-13d|-14d"
data <- data %>%
mutate(case=ifelse(grepl(under14,data$patient),'under14days','over14days'))
但是,这似乎非常笨拙,实际上已经花了很长时间才键入。我将不得不多次更改搜索字词,因此想要更快的方法吗?也许使用某种正则表达式是最好的选择,但是我真的不知道从哪里开始。
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] readxl_1.1.0 Rmisc_1.5 plyr_1.8.4 lattice_0.20-35 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.5 purrr_0.2.5
[9] readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_2.2.1 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 cellranger_1.1.0 pillar_1.2.3 compiler_3.5.0 bindr_0.1.1 tools_3.5.0 lubridate_1.7.4
[8] jsonlite_1.5 nlme_3.1-137 gtable_0.2.0 pkgconfig_2.0.1 rlang_0.2.1 psych_1.8.4 cli_1.0.0
[15] rstudioapi_0.7 yaml_2.1.19 parallel_3.5.0 haven_1.1.1 bindrcpp_0.2.2 xml2_1.2.0 httr_1.3.1
[22] hms_0.4.2 grid_3.5.0 tidyselect_0.2.4 glue_1.2.0 R6_2.2.2 foreign_0.8-70 modelr_0.1.2
[29] reshape2_1.4.3 magrittr_1.5 scales_0.5.0 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5 colorspace_1.3-2
[36] utf8_1.1.4 stringi_1.2.3 lazyeval_0.2.1 munsell_0.5.0 broom_0.4.4 crayon_1.3.4
>
答案 0 :(得分:4)
一种可能性是使用tidyr::separate
library(tidyverse)
df %>%
separate(patient, into = c("ID1", "Days", "ID2"), sep = "-", extra = "merge", remove = F) %>%
mutate(case = ifelse(as.numeric(Days) <= 14, "under14days", "over14days")) %>%
select(-ID1, -ID2)
# protein patient Days value case
#1 BOD1L2 RF0064_Case-9-d- 9 10.40 under14days
#2 PPFIA2 RF0064_Case-20-d- 20 7.83 over14days
#3 STAT4 RF0064_Case-11-d- 11 11.00 under14days
#4 TOM1L2 RF0064_Case-29-d- 29 13.00 over14days
#5 SH2D2A RF0064_Case-2-d- 2 8.28 under14days
#6 TIGD4 RF0064_Case-49-d- 49 9.71 over14days
df <-read.table(text =
" protein patient value
1 BOD1L2 RF0064_Case-9-d- 10.4
2 PPFIA2 RF0064_Case-20-d- 7.83
3 STAT4 RF0064_Case-11-d- 11.0
4 TOM1L2 RF0064_Case-29-d- 13.0
5 SH2D2A RF0064_Case-2-d- 8.28
6 TIGD4 RF0064_Case-49-d- 9.71 ", header = T, row.names = 1)
答案 1 :(得分:4)
由于明确定义了patient
的格式,因此在base-R
中使用gsub
的可能解决方案可以是提取days
并检查范围为:>
df$case <- ifelse(as.integer(gsub("RF0064_Case-(\\d+)-d-","\\1", df$patient)) <= 14,
"under14days", "over14days")
完全一样,OP可以将mutate
中使用的代码修改为:
library(dplyr)
df <- df %>%
mutate(case = ifelse(as.integer(gsub("RF0064_Case-(\\d+)-d-","\\1", patient)) <= 14,
"under14days", "over14days"))
df
# protein patient value case
# 1 BOD1L2 RF0064_Case-9-d- 10.40 under14days
# 2 PPFIA2 RF0064_Case-20-d- 7.83 over14days
# 3 STAT4 RF0064_Case-11-d- 11.00 under14days
# 4 TOM1L2 RF0064_Case-29-d- 13.00 over14days
# 5 SH2D2A RF0064_Case-2-d- 8.28 under14days
# 6 TIGD4 RF0064_Case-49-d- 9.71 over14days
数据:
df <- read.table(text =
"protein patient value
1 BOD1L2 RF0064_Case-9-d- 10.4
2 PPFIA2 RF0064_Case-20-d- 7.83
3 STAT4 RF0064_Case-11-d- 11.0
4 TOM1L2 RF0064_Case-29-d- 13.0
5 SH2D2A RF0064_Case-2-d- 8.28
6 TIGD4 RF0064_Case-49-d- 9.71",
header = TRUE, stringsAsFactors = FALSE)
答案 2 :(得分:3)
我们也可以使用正则表达式直接提取数字。 ?<=-
在后面,用“-”标识位置。
library(tidyverse)
dat2 <- dat %>%
mutate(Day = as.numeric(str_extract(patient, pattern = "(?<=-)[0-9]*"))) %>%
mutate(case = ifelse(Day <= 14,'under14days','over14days'))
dat2
# protein patient value Day case
# 1 BOD1L2 RF0064_Case-9-d- 10.40 9 under14days
# 2 PPFIA2 RF0064_Case-20-d- 7.83 20 over14days
# 3 STAT4 RF0064_Case-11-d- 11.00 11 under14days
# 4 TOM1L2 RF0064_Case-29-d- 13.00 29 over14days
# 5 SH2D2A RF0064_Case-2-d- 8.28 2 under14days
# 6 TIGD4 RF0064_Case-49-d- 9.71 49 over14days
数据
dat <- read.table(text = " protein patient value
1 BOD1L2 'RF0064_Case-9-d-' 10.4
2 PPFIA2 'RF0064_Case-20-d-' 7.83
3 STAT4 'RF0064_Case-11-d-' 11.0
4 TOM1L2 'RF0064_Case-29-d-' 13.0
5 SH2D2A 'RF0064_Case-2-d-' 8.28
6 TIGD4 'RF0064_Case-49-d-' 9.71",
header = TRUE, stringsAsFactors = FALSE)