基于字符串匹配在R数据帧中填充变量值

时间:2017-03-21 01:18:47

标签: r merge

我在R中有以下数据框:

df <- data.frame(Sample_name = c("01_00H_NA_DNA",   "01_00H_NA_RNA",    "01_00H_NA_S",  "01_00H_NW_DNA",    "01_00H_NW_RNA",    "01_00H_NW_S",  "01_00H_OM_DNA",    "01_00H_OM_RNA",    "01_00H_OM_S",  "01_00H_RL_DNA",    "01_00H_RL_RNA",    "01_00H_RL_S"),
             Pair = c("","", "S1","","","S2","","","S3","", "","S5"))

我想生成一个新变量Label,以便Sample_name中的相似字符串直到_之前的DNA/RNA or S匹配才能得到类似的标签ID号。虽然每一行可能不以01_00H开头,但在标签变量的最后一个下划线之前总会有类似的字符串。

此外,我还想用相似的值填充pair变量,对所有相同的标签填充S1,依此类推。现有的Pair值不是连续的,即S3后面是S5,依此类推。

结果数据框看起来像这样:

The resulting dataframe will look something like this:

这非常难以做到,我跟着How to create new column in dataframe based on partial string matching other column in R,但它只帮助我直接1:1重命名。

来自useRs的任何解决方案都将非常感谢,谢谢!

1 个答案:

答案 0 :(得分:1)

试试这个:

df$x <- gsub("_[^_]+$", "", df$Sample_name)
df$Label <- match(df$x, unique(df$x))
df$Pair <- ave(as.character(df$Pair), df$Label, FUN=max)
df$x <- NULL
df
#      Sample_name Pair Label
# 1  01_00H_NA_DNA   S1     1
# 2  01_00H_NA_RNA   S1     1
# 3    01_00H_NA_S   S1     1
# 4  01_00H_NW_DNA   S2     2
# 5  01_00H_NW_RNA   S2     2
# 6    01_00H_NW_S   S2     2
# 7  01_00H_OM_DNA   S3     3
# 8  01_00H_OM_RNA   S3     3
# 9    01_00H_OM_S   S3     3
# 10 01_00H_RL_DNA   S5     4
# 11 01_00H_RL_RNA   S5     4
# 12   01_00H_RL_S   S5     4

或使用dplyr

library(dplyr)
df %>%
  mutate(
    x = gsub("_[^_]+$", "", Sample_name),
    Label = match(x, unique(x))
  ) %>%
  select(-x) %>%
  group_by(Label) %>%
  mutate(Pair = paste0(Pair, collapse = "")) %>%
  ungroup()
# # A tibble: 12 × 3
#      Sample_name  Pair Label
#           <fctr> <chr> <int>
# 1  01_00H_NA_DNA    S1     1
# 2  01_00H_NA_RNA    S1     1
# 3    01_00H_NA_S    S1     1
# 4  01_00H_NW_DNA    S2     2
# 5  01_00H_NW_RNA    S2     2
# 6    01_00H_NW_S    S2     2
# 7  01_00H_OM_DNA    S3     3
# 8  01_00H_OM_RNA    S3     3
# 9    01_00H_OM_S    S3     3
# 10 01_00H_RL_DNA    S5     4
# 11 01_00H_RL_RNA    S5     4
# 12   01_00H_RL_S    S5     4

修改:添加了@ thelatemail对ave的使用,更好地通过codegolf和稳定性。