我有以下虚拟数据帧:
structure(list(ref = structure(1:7, .Label = c("a", "b", "c",
"d", "e", "f", "g"), class = "factor"), gene = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L), .Label = c("gyrA", "parC"), class = "factor"),
result = structure(c(2L, 4L, 6L, 2L, 3L, 5L, 1L), .Label = c("S479T",
"S83L", "S83L, D678E, D741E", "S83L, D87G", "T765E", "V196A, M248V, E678D"
), class = "factor")), class = "data.frame", row.names = c(NA,
-7L))
看起来像这样:
ref gene result
a gyrA S83L
b gyrA S83L, D87G
c gyrA V196A, M248V, E678D
d gyrA S83L
e gyrA S83L, D678E, D741E
f parC T765E
g parC S479T
我想做的是检查“结果”列中的数值(每个条目中两个字母之间)是否在特定范围内,特别是67-106,但仅当“基因”列= = gyrA。需要检查“结果”列中每个单元格中的所有数字。 如果单元格中的任何数字在指定范围内,则result_pos中的结果应返回1。 我尝试了以下方法:
df %>%
mutate(gyrA_pos = ifelse(gene == "gyrA", gsub("[[:alpha:]]", "", result), NA),
result_pos = ifelse(gene == "gyrA" & gyrA_pos %in% as.character(seq(from = 67, to = 106)) == TRUE, 1, 0))
这有效,但仅适用于只有一个值的条目。我还发现在匹配之前必须创建一个带有字母去除的列的过程很麻烦。我最后要这样:
ref gene result result_pos
a gyrA S83L 1
b gyrA S83L, D87G 1
c gyrA V196A, M248V, E678D 0
d gyrA S83L 1
e gyrA S83L, D678E, D741E 1
f parC T765E NA
g parC S479T NA
答案 0 :(得分:2)
这是一种方式。您可以使用str_extract_all
来获取result
中的所有数字,而不仅仅是第一个,然后使用map
和any
来检查是否有任何数字在指定范围。最后只是在需要的地方插入NA
并转换为整数。
library(tidyverse)
df <- structure(list(ref = structure(1:7, .Label = c("a", "b", "c", "d", "e", "f", "g"), class = "factor"), gene = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("gyrA", "parC"), class = "factor"), result = structure(c(2L, 4L, 6L, 2L, 3L, 5L, 1L), .Label = c("S479T", "S83L", "S83L, D678E, D741E", "S83L, D87G", "T765E", "V196A, M248V, E678D"), class = "factor")), class = "data.frame", row.names = c(NA, -7L))
df %>%
mutate(
result_pos = result %>%
str_extract_all("\\d+") %>%
map(as.integer) %>%
map_lgl(~ any(.x >= 67L & .x <= 106L)),
result_pos = if_else(gene != "gyrA", NA, result_pos),
result_pos = as.integer(result_pos)
)
#> ref gene result result_pos
#> 1 a gyrA S83L 1
#> 2 b gyrA S83L, D87G 1
#> 3 c gyrA V196A, M248V, E678D 0
#> 4 d gyrA S83L 1
#> 5 e gyrA S83L, D678E, D741E 1
#> 6 f parC T765E NA
#> 7 g parC S479T NA
由Vue.set()(v0.2.0)于2018-09-04创建。
答案 1 :(得分:1)
这是一个data.table
选项。
library(data.table)
setDT(DF)
DF[, `:=`(result = as.character(result), # coerce result to character
result_pos = NA_integer_)] # set result_pos to NA
DF[gene == 'gyrA', result_pos := {
x <-
lapply(strsplit(result, split = ","),
gsub,
pattern = "\\D+",
replacement = "")
as.integer(sapply(x, function(i)
any(as.numeric(i) >= 67 & as.numeric(i) <= 106)))
}][]
# ref gene result result_pos
#1: a gyrA S83L 1
#2: b gyrA S83L, D87G 1
#3: c gyrA V196A, M248V, E678D 0
#4: d gyrA S83L 1
#5: e gyrA S83L, D678E, D741E 1
#6: f parC T765E NA
#7: g parC S479T NA
想法是strsplit
列result
,删除字母,检查条件并返回整数,仅针对gene == 'gyrA'
行。