如何在多个数据框中获取多数值

时间:2016-10-26 17:56:41

标签: r

我有三个由不同方法生成的数据帧(df1,df2,df3)。它们具有相同的数据结构,但它们的值可能不同。每个单元格是四个值“A”,“B”,“H”或“ - ”中的一个。我想通过取多数值从三个数据框中建立共识表,否则给出“ - ”。真的很感激任何帮助。

df1 = read.table(text="ID   S01 S02 S03 S04 S05
M01 A   H   A   B   B
M02 A   H   A   B   A
M03 A   A   H   B   A
M04 B   A   H   B   H
M05 B   A   H   B   A
M06 B   B   H   B   A
M07 H   B   B   H   B
M08 H   B   B   H   A
M09 H   B   B   H   A
M10 H   B   B   H   A", header=T, stringsAsFactors=F)

df2 = read.table(text="ID   S01 S02 S03 S04 S05
M01 A   H   A   B   A
M02 A   H   A   B   A
M03 H   A   H   B   A
M04 H   A   H   B   A
M05 B   A   H   B   A
M06 B   A   B   B   A
M07 -   B   B   -   B
M08 H   B   B   H   A
M09 H   B   B   H   A
M10 H   B   B   H   A", header=T, stringsAsFactors=F)

df3 = read.table(text="ID   S01 S02 S03 S04 S05
M01 B   H   A   B   A
M02 A   H   A   B   A
M03 B   A   H   B   A
M04 B   A   H   B   B
M05 B   A   H   B   A
M06 B   A   H   B   A
M07 A   B   B   H   H
M08 H   B   B   H   A
M09 H   B   B   H   A
M10 H   B   B   H   A", header=T, stringsAsFactors=F)

预期结果:

df = read.table(text="ID    S01 S02 S03 S04 S05
M01 A   H   A   B   A
M02 A   H   A   B   A
M03 -   A   H   B   A
M04 B   A   H   B   -
M05 B   A   H   B   A
M06 B   A   H   B   A
M07 -   B   B   H   B
M08 H   B   B   H   A
M09 H   B   B   H   A
M10 H   B   B   H   A", header=T, stringsAsFactors=F)

3 个答案:

答案 0 :(得分:2)

我们将数据集保存在listrbind中,然后按“ID”分组,遍历列,获取元素的Mode

library(data.table)
Mode <- function(x) {
  if(uniqueN(x)==length(x)){
   "-" } else {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))] }
 }

rbindlist(mget(paste0("df", 1:3)))[, lapply(.SD, Mode) , by = ID]
#      ID S01 S02 S03 S04 S05
# 1: M01   A   H   A   B   A
# 2: M02   A   H   A   B   A
# 3: M03   -   A   H   B   A
# 4: M04   B   A   H   B   -
# 5: M05   B   A   H   B   A
# 6: M06   B   A   H   B   A
# 7: M07   -   B   B   H   B
# 8: M08   H   B   B   H   A
# 9: M09   H   B   B   H   A
#10: M10   H   B   B   H   A

答案 1 :(得分:2)

与@ akrun的答案类似,但我将表格连接在一起并找到每个单元格的模式略有不同:

将表格加入&#34; data.master&#34;):

df1$df <- 1
df2$df <- 2
df3$df <- 3

data.master <- do.call(rbind, list(df1, df2, df3))

计算模式:

library(dplyr)

data.mode <- data.master %>% 
    group_by(ID) %>% 
    summarize_all(function(x) ifelse(sort(table(x), decreasing = T)[1] > 1, names(sort(table(x), decreasing = T))[1], '-')) %>% 
    select(-df)

      ID   S01   S02   S03   S04   S05
   <chr> <chr> <chr> <chr> <chr> <chr>
1    M01     A     H     A     B     A
2    M02     A     H     A     B     A
3    M03     -     A     H     B     A
4    M04     B     A     H     B     -
5    M05     B     A     H     B     A
6    M06     B     A     H     B     A
7    M07     -     B     B     H     B
8    M08     H     B     B     H     A
9    M09     H     B     B     H     A
10   M10     H     B     B     H     A

答案 2 :(得分:1)

Base R解决方案:

options(stringsAsFactors = FALSE)
moda = function(x){
    # here we rely on the fact that we have only three data.frame's
    dupl = anyDuplicated(x)
    if(dupl){
        x[dupl]
    } else {
        "-"
    }
}

aggregate(. ~ ID, 
          data = rbind(df1, df2, df3), 
          FUN = moda
          )

#     ID S01 S02 S03 S04 S05
# 1  M01   A   H   A   B   A
# 2  M02   A   H   A   B   A
# 3  M03   -   A   H   B   A
# 4  M04   B   A   H   B   -
# 5  M05   B   A   H   B   A
# 6  M06   B   A   H   B   A
# 7  M07   -   B   B   H   B
# 8  M08   H   B   B   H   A
# 9  M09   H   B   B   H   A
# 10 M10   H   B   B   H   A