提取多个数据框的公共值对,以基于它们创建一个新的二进制数据框

时间:2018-10-09 15:11:22

标签: r

我有3个数据框

Drug<-c("ab","bc","cd","ef","gh")
Target<-c("qwewr","saff","cxzcc","sadda","sadd")
fileA<-data.frame(Drug,Target)


Drug<-c("ab","bc","cdD","efc","ghg","hj")
Target<-c("qwewr","saff","cxzccf","saddav","sadd","bn")
fileB<-data.frame(Drug,Target)


Drug<-c("abB","bcv","cdD","efc")
Target<-c("qwewrm","saff","cxzccfh","saddav")
fileC<-data.frame(Drug,Target)

如您所见,每个对象都包含一对“药物”-“目标”。每个数据框仅包含唯一对。但是您可以在其他数据框中找到完全相同的一对。我要实现的是创建一个新的数据框,该数据框将提取第一列中的所有唯一对,然后在其他3列中分别包含fileA, fileBfileC,并用{{ 1}}(如果该对存在)和1(如果该对不存在)。像这样:

0

但是这里的数据框是不正确的,因为在第一列中只有药物名称,而且每一行都应该至少有一个Pairs fileA fileB fileC 1: abqwewr 1 1 1 2: bcsaff 1 1 1 3: cdcxzcc 1 1 1 4: efsadda 1 1 1 5: ghsadd 1 1 0 6: cdDcxzccf 0 0 0 7: efcsaddav 0 0 0 8: ghgsadd 0 0 0 9: hjbn 0 0 0 10: abBqwewrm 0 0 0 11: bcvsaff 0 0 0 12: cdDcxzccfh 0 0 0

我的方法:

1

3 个答案:

答案 0 :(得分:5)

由三个对象组成的列表L,并使用lapply将其列粘贴在一起,然后stack创建一个两列数据框,其中粘贴了值并指示了哪个对象它来自。最后使用table提供计数。

L <- mget(ls(pattern = "file"))
s <- stack(lapply(L, function(x) paste0(x[[1]], x[[2]])))
table(s)

给予:

            ind
values       fileA fileB fileC
  abBqwewrm      0     0     1
  abqwewr        1     1     0
  bcsaff         1     1     0
  bcvsaff        0     0     1
  cdcxzcc        1     0     0
  cdDcxzccf      0     1     0
  cdDcxzccfh     0     0     1
  efcsaddav      0     1     1
  efsadda        1     0     0
  ghgsadd        0     1     0
  ghsadd         1     0     0
  hjbn           0     1     0

对此的一种变化是将其表示为以下管道:

library(magrittr)
mget(ls(pattern = "file")) %>%
  lapply(function(x) paste0(x[[1]], x[[2]])) %>%
  stack %>%
  table

答案 1 :(得分:2)

您可以先创建对,然后在它们之间合并,同时携带一列数据来自何处:

在每个文件中创建指标列:

fileA$fileA <- 1
fileB$fileB <- 1
fileC$fileC <- 1

在每个文件中创建配对:

fileA$DrugTargetPair <- paste0(fileA$Drug, fileA$Target)
fileB$DrugTargetPair <- paste0(fileB$Drug, fileB$Target)
fileC$DrugTargetPair <- paste0(fileC$Drug, fileC$Target)

仅选择指标列和Pairs栏:

fileA <- fileA[, c("DrugTargetPair", "fileA")]
fileB <- fileB[, c("DrugTargetPair", "fileB")]
fileC <- fileC[, c("DrugTargetPair", "fileC")]

在“对”列上合并,用all = T保留所有对:

file_new <- merge(fileA, fileB, by = "DrugTargetPair", all = T)
file_new <- merge(file_new, fileC, by = "DrugTargetPair", all = T)
file_new[is.na(file_new)] <- 0

file_new
   DrugTargetPair fileA fileB fileC
1       abBqwewrm     0     0     1
2         abqwewr     1     1     0
3          bcsaff     1     1     0
4         bcvsaff     0     0     1
5         cdcxzcc     1     0     0
6       cdDcxzccf     0     1     0
7      cdDcxzccfh     0     0     1
8       efcsaddav     0     1     1
9         efsadda     1     0     0
10        ghgsadd     0     1     0
11         ghsadd     1     0     0
12           hjbn     0     1     0

答案 2 :(得分:1)

数据:

Drug<-c("ab","bc","cd","ef","gh")
Target<-c("qwewr","saff","cxzcc","sadda","sadd")
fileA<-data.frame(I(Drug),I(Target))


Drug<-c("ab","bc","cdD","efc","ghg","hj")
Target<-c("qwewr","saff","cxzccf","saddav","sadd","bn")
fileB<-data.frame(I(Drug),I(Target))


Drug<-c("abB","bcv","cdD","efc")
Target<-c("qwewrm","saff","cxzccfh","saddav")
fileC<-data.frame(I(Drug),I(Target))

代码:

all_list <- list(fileA, fileB, fileC)

all1 <- rbind(fileA,fileB,fileC)
all1 <- as.data.frame(unique(all1))

ans <- t(apply(all1, 1, function(drgT){ sapply(all_list, function(x) {(list(drgT) %in% unlist(apply(x,1,list), recursive = F))*1} ) }))
ans[rowSums(ans) == 1,] <- 0

cbind(all1, ans)

结果:

#   Drug  Target 1 2 3
#1    ab   qwewr 1 1 0
#2    bc    saff 1 1 0
#3    cd   cxzcc 0 0 0
#4    ef   sadda 0 0 0
#5    gh    sadd 0 0 0
#8   cdD  cxzccf 0 0 0
#9   efc  saddav 0 1 1
#10  ghg    sadd 0 0 0
#11   hj      bn 0 0 0
#12  abB  qwewrm 0 0 0
#13  bcv    saff 0 0 0
#14  cdD cxzccfh 0 0 0

请注意:

  • 请修改示例数据/所需结果。
  • E D U C A T E亲自使用stringsAsFactors。