原始数据:
# reproducible example
df1 <- read.table(text = "
Data1 Data2 Data3 Data4
gene6 - gene1 gene1
- gene2 - gene2
gene3 - gene3 gene5
gene2 gene4 gene2 gene4
gene4 gene1 gene5 -
gene1 gene3 gene6 gene3",
header = TRUE, stringsAsFactors = FALSE)
df1
# Data1 Data2 Data3 Data4
# 1 gene6 - gene1 gene1
# 2 - gene2 - gene2
# 3 gene3 - gene3 gene5
# 4 gene2 gene4 gene2 gene4
# 5 gene4 gene1 gene5 -
# 6 gene1 gene3 gene6 gene3
预期输出需要像这样:
Data1 Data2 Data3 Data4
1 gene1 gene1 gene1 gene1
2 gene2 gene2 gene2 gene2
3 gene3 gene3 gene3 gene3
4 gene4 gene4 - gene4
5 - - gene5 gene5
6 gene6 - gene6 -
答案 0 :(得分:4)
下次你提出问题时,请参阅评论中@Sotos提供的链接,目前还不清楚你在问什么。但是,我喜欢这个拼图,所以无论如何我都试了一下。这是一个可能的答案:
# Sample data
df=read.table(text="Data1 Data2 Data3 Data4
gene6 - gene1 gene1
- gene2 - gene2
gene3 - gene3 gene5
gene2 gene4 gene2 gene4
gene4 gene1 gene5 -
gene1 gene3 gene6 gene3",header=T,stringsAsFactors=F)
# Helper function
completecolumn <- function(x,allgenes)
{
allgenes[!allgenes %in% x]='-'
return(allgenes)
}
# apply our function
allgenes=sort(setdiff(unique(unlist(df)), "-"))
df = do.call(cbind,lapply(df,completecolumn,allgenes))
输出:
Data1 Data2 Data3 Data4
[1,] "gene1" "gene1" "gene1" "gene1"
[2,] "gene2" "gene2" "gene2" "gene2"
[3,] "gene3" "gene3" "gene3" "gene3"
[4,] "gene4" "gene4" "-" "gene4"
[5,] "-" "-" "gene5" "gene5"
[6,] "gene6" "-" "gene6" "-"
希望这有帮助。
答案 1 :(得分:2)
以下是使用mixedsort
包中的gtools
library(gtools)
i1 <- mixedsort(unique(df[df != '-']))
sapply(df, function(i) i[match(i1, i)]))
给出了
Data1 Data2 Data3 Data4 [1,] "gene1" "gene1" "gene1" "gene1" [2,] "gene2" "gene2" "gene2" "gene2" [3,] "gene3" "gene3" "gene3" "gene3" [4,] "gene4" "gene4" NA "gene4" [5,] NA NA "gene5" "gene5" [6,] "gene6" NA "gene6" NA
答案 2 :(得分:2)
不是您想要的输出,但可能更有用:
library(tidyr)
# wide-to-long format, then table with margins to see "common" gene counts
addmargins(table(gather(df1)))
# value
# key - gene1 gene2 gene3 gene4 gene5 gene6 Sum
# Data1 1 1 1 1 1 0 1 6
# Data2 2 1 1 1 1 0 0 6
# Data3 1 1 1 1 0 1 1 6
# Data4 1 1 1 1 1 1 0 6
# Sum 5 4 4 4 3 2 2 24
答案 3 :(得分:2)
使用一点重塑的替代解决方案:
df=read.table(text="Data1 Data2 Data3 Data4
gene6 - gene1 gene1
- gene2 - gene2
gene3 - gene3 gene5
gene2 gene4 gene2 gene4
gene4 gene1 gene5 -
gene1 gene3 gene6 gene3",header=T,stringsAsFactors=F)
library(tidyverse)
df %>%
gather() %>%
filter(value != "-") %>%
mutate(id = as.integer(substr(value, 5, nchar(value)))) %>%
spread(key, value) %>%
select(-id)
# Data1 Data2 Data3 Data4
# 1 gene1 gene1 gene1 gene1
# 2 gene2 gene2 gene2 gene2
# 3 gene3 gene3 gene3 gene3
# 4 gene4 gene4 <NA> gene4
# 5 <NA> <NA> gene5 gene5
# 6 gene6 <NA> gene6 <NA>