我有一个包含多个文本列的数据框。我需要输出一个行值,具体取决于字符串是否存在。所以,如果我的输入是:
Kingdom | Phylum | Class | Order |
-------------------------------------------------------------------------
Bacteria | Firmicutes | Negativicutes | Selenomonadales |
Bacteria | Bact_unclassified | Bact_unclassified | Bact_unclassified |
Bacteria | Firmicutes | Negativicutes | Negativ_unclassified |
Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales |
Archaea | Euryarchaeota | Eury_unclassified | Eury_unclassified |
我希望我的输出像:
Output |
-----------------------
o_Selenomonadales |
k_Bacteria |
c_Negativicutes |
o_Methanobacteriales |
p_Euryarchaeota |
作为输出行中的前缀:" k"来自Kingldom," p"来自Phylum," c"来自Class和" o"来自订单。请注意,用于过滤的密钥字符串是" _unclassified"。有什么想法吗?
答案 0 :(得分:1)
使用逐行apply
将提供调查行方式数据的选项。查找包含_unclassified
的第1列。并减去1
以获取该行所需列的上一列。
现在,substring
提供了将所需列的第一个字符作为返回值前缀的选项。
df$Output <- apply(df, 1, function(x){
idx <- length(x) #By default value from last column will be returned
if(length(which(grepl("_unclassified", x))) > 0 ){
idx <- min(which(grepl("_unclassified", x)))-1
}
paste(toupper(substring(names(df)[idx], 1, 1)), trimws(x[idx]), sep = "_")
})
# Check result of
df["Output"]
# Output
# 1 O_Selenomonadales
# 2 K_Bacteria
# 3 C_Negativicutes
# 4 O_Methanobacteriales
# 5 P_Euryarchaeota
数据:强>
df <- read.table(text =
"Kingdom | Phylum | Class | Order
Bacteria | Firmicutes | Negativicutes | Selenomonadales
Bacteria | Bact_unclassified | Bact_unclassified | Bact_unclassified
Bacteria | Firmicutes | Negativicutes | Negativ_unclassified
Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales
Archaea | Euryarchaeota | Eury_unclassified | Eury_unclassified",
sep = "|", header = TRUE, stringsAsFactors = FALSE)
答案 1 :(得分:0)
library(tidyverse)
dat <- read_delim(
"Kingdom | Phylum | Class | Order
Bacteria | Firmicutes | Negativicutes | Selenomonadales
Bacteria | Bact_unclassified | Bact_unclassified | Bact_unclassified
Bacteria | Firmicutes | Negativicutes | Negativ_unclassified
Archaea | Euryarchaeota | Methanobacteria | Methanobacteriales
Archaea | Euryarchaeota | Eury_unclassified | Eury_unclassified
", delim = "|", trim_ws = TRUE)
priority = c(Order = 1L, Class = 2L, Phylum = 3L, Kingdom = 4L)
dat %>%
mutate(id = row_number()) %>%
gather(variable, value, -id) %>%
mutate(priority = priority[variable]) %>%
arrange(id, priority) %>%
group_by(id) %>%
slice(detect_index(value, Negate(grepl), pattern = "unclassified$")) %>%
mutate(Output = paste(tolower(substr(variable, 1, 1)), value, sep = "_"))
# # A tibble: 5 x 5
# # Groups: id [5]
# id variable value priority Output
# <int> <chr> <chr> <int> <chr>
# 1 1 Order Selenomonadales 1 o_Selenomonadales
# 2 2 Kingdom Bacteria 4 k_Bacteria
# 3 3 Class Negativicutes 2 c_Negativicutes
# 4 4 Order Methanobacteriales 1 o_Methanobacteriales
# 5 5 Phylum Euryarchaeota 3 p_Euryarchaeota
使用coalesce
的另一种方式:
dat %>%
imap_dfc(~ paste(tolower(substr(.y, 1, 1)), .x, sep = "_")) %>%
mutate_all(function(x) ifelse(grepl("unclassified$", x), NA, x)) %>%
mutate(Output = coalesce(Order, Class, Phylum, Kingdom))
# # A tibble: 5 x 5
# Kingdom Phylum Class Order Output
# <chr> <chr> <chr> <chr> <chr>
# 1 k_Bacteria p_Firmicutes c_Negativicutes o_Selenomonadales o_Selenomonadales
# 2 k_Bacteria NA NA NA k_Bacteria
# 3 k_Bacteria p_Firmicutes c_Negativicutes NA c_Negativicutes
# 4 k_Archaea p_Euryarchaeota c_Methanobacteria o_Methanobacteriales o_Methanobacteriales
# 5 k_Archaea p_Euryarchaeota NA NA p_Euryarchaeota