我说过这个数据框:
gene0 1 2 3
gene1 0 0 5
gene2 6 8 0
gene3 5 5 5
0 0 5
1 2 3
我想将“未命名”列中的数字与基因相关联,以使其具有以下特征:
gene0 1 2 3
gene1 0 0 5
gene2 6 8 0
gene3 5 5 5
gene1 0 0 5
gene0 1 2 3
最好的方法是什么?我需要为此使用linux还是R?
答案 0 :(得分:2)
一个dplyr
和tidyr
选项可以是:
df %>%
group_by_at(-1) %>%
fill(V1)
V1 V2 V3 V4
<chr> <int> <int> <int>
1 gene0 1 2 3
2 gene1 0 0 5
3 gene2 6 8 0
4 gene3 5 5 5
5 gene1 0 0 5
6 gene0 1 2 3
或者:
df %>%
group_by(group = group_indices(., !!!select(., -1))) %>%
fill(V1) %>%
ungroup() %>%
select(-group)
样本数据:
df <- read.table(text = "gene0 1 2 3
gene1 0 0 5
gene2 6 8 0
gene3 5 5 5
NA 0 0 5
NA 1 2 3",
header = FALSE,
na.strings = "NA",
stringsAsFactors = FALSE)
答案 1 :(得分:0)
天真的解决方案
图书馆(tidyverse)
df <- tribble(~col1,~col2,~col3,
1,2,3,
0,0,5,
6,8,0,
5,5,5,
0,0,5,
1,2,3,
1,1,1)
df %>%
mutate(gene = case_when(col1 == 1 & col2 == 2 &col3 == 3 ~ "gene0",
col1 == 0 & col2 == 0 &col3 == 5 ~ "gene1",
col1 == 6 & col2 == 8 &col3 == 0 ~ "gene2",
col1 == 5 & col2 == 5 &col3 == 5 ~ "gene3",
TRUE ~ "unkown_gene"))
另一个更可扩展的选择是创建一个带有基因定义的表格(甚至可以从excel等导入)
df1 <- tribble(~gene,~col1,~col2,~col3,
'gene0',1,2,3,
'gene1',0,0,5,
'gene2',6,8,0,
'gene3',5,5,5)
并简单地加入新的观察
df %>%
left_join(df1)
答案 2 :(得分:0)
我们可以使用match
中的base R
a1 <- do.call(paste, df1[-1])
df1$V1 <- df1$V1[match(a1, unique(a1))]
df1$V1
#[1] "gene0" "gene1" "gene2" "gene3" "gene1" "gene0"
使用OP的数据集
df1 <- read.csv("newest.csv", stringsAsFactors = FALSE)
df1$id[df1$id == ""] <- NA
a1 <- do.call(paste, df1[-1])
df1$id <- df1$id[match(a1, unique(a1))]
length(unique(df1$id))
#[1] 621
head(df1$id, 20)
#[1] "pop13_110" "pop1_2" "pop16_108" "pop2_10" "pop2_2" "pop2_3" "pop2_4" "pop2_5" "pop2_6" "pop2_7" "pop2_8"
#[12] "pop2_9" "pop2_10" "pop2_11" "pop7_81" "pop2_13" "pop2_15" "pop2_15" "pop2_16" "pop22_20"
tail(df1$id, 20)
# [1] "pop22_2" "pop22_3" "pop22_4" "pop22_5" "pop22_8" "pop22_9" "pop13_60" "pop16_131" "pop23_11" "pop22_25" "pop22"
#[12] "pop22_14" "pop22_15" "pop22_32" "pop22_28" "pop16_56" "pop22_18" "pop9_9" "pop22_21" "pop22_22"
df1 <- structure(list(V1 = c("gene0", "gene1", "gene2", "gene3", NA,
NA), V2 = c(1L, 0L, 6L, 5L, 0L, 1L), V3 = c(2L, 0L, 8L, 5L, 0L,
2L), V4 = c(3L, 5L, 0L, 5L, 5L, 3L)), class = "data.frame",
row.names = c(NA,
-6L))
答案 3 :(得分:0)
这里是@akrun的解决方案之外的另一种解决方案,其中base R
用于match()
,用于从V2
到{{1} }
V4
如此
df$V1[which(is.na(df$V1))] <- df$V1[match(data.frame(t(subset(df,is.na(df$V1))[-1])),
data.frame(t(subset(df,is.na(df$V1))[-1])))]