我在R中有以下数据框:
df <- data.frame(Sample_name = c("01_00H_NA_DNA", "01_00H_NA_RNA", "01_00H_NA_S", "01_00H_NW_DNA", "01_00H_NW_RNA", "01_00H_NW_S", "01_00H_OM_DNA", "01_00H_OM_RNA", "01_00H_OM_S", "01_00H_RL_DNA", "01_00H_RL_RNA", "01_00H_RL_S"),
Pair = c("","", "S1","","","S2","","","S3","", "","S5"))
我想生成一个新变量Label
,以便Sample_name
中的相似字符串直到_
之前的DNA/RNA or S
匹配才能得到类似的标签ID号。虽然每一行可能不以01_00H
开头,但在标签变量的最后一个下划线之前总会有类似的字符串。
此外,我还想用相似的值填充pair变量,对所有相同的标签填充S1,依此类推。现有的Pair值不是连续的,即S3后面是S5,依此类推。
结果数据框看起来像这样:
这非常难以做到,我跟着How to create new column in dataframe based on partial string matching other column in R,但它只帮助我直接1:1重命名。
来自useRs的任何解决方案都将非常感谢,谢谢!
答案 0 :(得分:1)
试试这个:
df$x <- gsub("_[^_]+$", "", df$Sample_name)
df$Label <- match(df$x, unique(df$x))
df$Pair <- ave(as.character(df$Pair), df$Label, FUN=max)
df$x <- NULL
df
# Sample_name Pair Label
# 1 01_00H_NA_DNA S1 1
# 2 01_00H_NA_RNA S1 1
# 3 01_00H_NA_S S1 1
# 4 01_00H_NW_DNA S2 2
# 5 01_00H_NW_RNA S2 2
# 6 01_00H_NW_S S2 2
# 7 01_00H_OM_DNA S3 3
# 8 01_00H_OM_RNA S3 3
# 9 01_00H_OM_S S3 3
# 10 01_00H_RL_DNA S5 4
# 11 01_00H_RL_RNA S5 4
# 12 01_00H_RL_S S5 4
或使用dplyr
:
library(dplyr)
df %>%
mutate(
x = gsub("_[^_]+$", "", Sample_name),
Label = match(x, unique(x))
) %>%
select(-x) %>%
group_by(Label) %>%
mutate(Pair = paste0(Pair, collapse = "")) %>%
ungroup()
# # A tibble: 12 × 3
# Sample_name Pair Label
# <fctr> <chr> <int>
# 1 01_00H_NA_DNA S1 1
# 2 01_00H_NA_RNA S1 1
# 3 01_00H_NA_S S1 1
# 4 01_00H_NW_DNA S2 2
# 5 01_00H_NW_RNA S2 2
# 6 01_00H_NW_S S2 2
# 7 01_00H_OM_DNA S3 3
# 8 01_00H_OM_RNA S3 3
# 9 01_00H_OM_S S3 3
# 10 01_00H_RL_DNA S5 4
# 11 01_00H_RL_RNA S5 4
# 12 01_00H_RL_S S5 4
修改:添加了@ thelatemail对ave
的使用,更好地通过codegolf和稳定性。