我有三个由不同方法生成的数据帧(df1,df2,df3)。它们具有相同的数据结构,但它们的值可能不同。每个单元格是四个值“A”,“B”,“H”或“ - ”中的一个。我想通过取多数值从三个数据框中建立共识表,否则给出“ - ”。真的很感激任何帮助。
df1 = read.table(text="ID S01 S02 S03 S04 S05
M01 A H A B B
M02 A H A B A
M03 A A H B A
M04 B A H B H
M05 B A H B A
M06 B B H B A
M07 H B B H B
M08 H B B H A
M09 H B B H A
M10 H B B H A", header=T, stringsAsFactors=F)
df2 = read.table(text="ID S01 S02 S03 S04 S05
M01 A H A B A
M02 A H A B A
M03 H A H B A
M04 H A H B A
M05 B A H B A
M06 B A B B A
M07 - B B - B
M08 H B B H A
M09 H B B H A
M10 H B B H A", header=T, stringsAsFactors=F)
df3 = read.table(text="ID S01 S02 S03 S04 S05
M01 B H A B A
M02 A H A B A
M03 B A H B A
M04 B A H B B
M05 B A H B A
M06 B A H B A
M07 A B B H H
M08 H B B H A
M09 H B B H A
M10 H B B H A", header=T, stringsAsFactors=F)
预期结果:
df = read.table(text="ID S01 S02 S03 S04 S05
M01 A H A B A
M02 A H A B A
M03 - A H B A
M04 B A H B -
M05 B A H B A
M06 B A H B A
M07 - B B H B
M08 H B B H A
M09 H B B H A
M10 H B B H A", header=T, stringsAsFactors=F)
答案 0 :(得分:2)
我们将数据集保存在list
,rbind
中,然后按“ID”分组,遍历列,获取元素的Mode
library(data.table)
Mode <- function(x) {
if(uniqueN(x)==length(x)){
"-" } else {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))] }
}
rbindlist(mget(paste0("df", 1:3)))[, lapply(.SD, Mode) , by = ID]
# ID S01 S02 S03 S04 S05
# 1: M01 A H A B A
# 2: M02 A H A B A
# 3: M03 - A H B A
# 4: M04 B A H B -
# 5: M05 B A H B A
# 6: M06 B A H B A
# 7: M07 - B B H B
# 8: M08 H B B H A
# 9: M09 H B B H A
#10: M10 H B B H A
答案 1 :(得分:2)
与@ akrun的答案类似,但我将表格连接在一起并找到每个单元格的模式略有不同:
将表格加入&#34; data.master&#34;):
df1$df <- 1
df2$df <- 2
df3$df <- 3
data.master <- do.call(rbind, list(df1, df2, df3))
计算模式:
library(dplyr)
data.mode <- data.master %>%
group_by(ID) %>%
summarize_all(function(x) ifelse(sort(table(x), decreasing = T)[1] > 1, names(sort(table(x), decreasing = T))[1], '-')) %>%
select(-df)
ID S01 S02 S03 S04 S05
<chr> <chr> <chr> <chr> <chr> <chr>
1 M01 A H A B A
2 M02 A H A B A
3 M03 - A H B A
4 M04 B A H B -
5 M05 B A H B A
6 M06 B A H B A
7 M07 - B B H B
8 M08 H B B H A
9 M09 H B B H A
10 M10 H B B H A
答案 2 :(得分:1)
Base R解决方案:
options(stringsAsFactors = FALSE)
moda = function(x){
# here we rely on the fact that we have only three data.frame's
dupl = anyDuplicated(x)
if(dupl){
x[dupl]
} else {
"-"
}
}
aggregate(. ~ ID,
data = rbind(df1, df2, df3),
FUN = moda
)
# ID S01 S02 S03 S04 S05
# 1 M01 A H A B A
# 2 M02 A H A B A
# 3 M03 - A H B A
# 4 M04 B A H B -
# 5 M05 B A H B A
# 6 M06 B A H B A
# 7 M07 - B B H B
# 8 M08 H B B H A
# 9 M09 H B B H A
# 10 M10 H B B H A