我有两个数据帧A& B除了主键外都有相同的列(在真实数据中我有50多列这样的列),现在我想比较一下'摘要' stastics(正常的R summary()命令)用于数据帧,但为了进行比较,我希望在附加的图像中看到它彼此相邻。
DATAFRAME DPUT STRUCTURE
structure(list(Pkey = c(1, 2, 3, 4, 5), Phy_marks = c(43, 44, 45,
46, 47), Math_marks = c(34, 34, 45, 32, 21)), .Names = c("Pkey",
"Phy_marks", "Math_marks"), row.names = c(NA, -5L), class =
"data.frame")
structure(list(Pkey = c(11, 12, 13, 14, 15), Phy_marks = c(43, 44,
45, 46, 47), Math_marks = c(34, 34, 45, 32, 21)), .Names = c("Pkey",
"Phy_marks", "Math_marks"), row.names = c(NA, -5L), class =
"data.frame")
请帮助!!!
答案 0 :(得分:3)
您可以利用我在下面创建的功能来比较两个数据集。
library(dplyr)
compare_them <- function(data1,data2) {
sum1 <- apply(data1,2,summary) %>% data.frame()
sum2 <- apply(data2,2,summary) %>% data.frame()
names(sum1) <- paste0(names(sum1),"1")
names(sum2) <- paste0(names(sum2),"2")
final <- cbind(sum1,sum2)
final1 <- t(final)
final2 <- final1[order(row.names(final1)), ]
final_1 <- t(final2) %>% data.frame()
final_1
}
compare_them(mtcars,mtcars*2) %>% View()
data1变量将具有&#34; 1&#34;最后,data2变量将具有&#34; 2&#34;到底。我以mtcars和mtcars * 2为例。最终结果如下所示。
答案 1 :(得分:0)
一个选项是使用summarise.all
,dcast
,unite
和separate
来计算每个数据的所需统计数据。框架排列相同。
注意:OP
提供的示例数据略有修改,df_b
的统计数据与df_a
不同。
library(tidyverse)
library(reshape2)
df_a %>% mutate(Grp = "A") %>%
bind_rows(mutate(df_b, Grp = "B")) %>%
select(-Pkey) %>%
group_by(Grp) %>% {
inner_join(inner_join(inner_join(summarise_all(.,funs(min,mean,median, max)),
summarise_all(.,funs(Q1 = quantile), probs = 0.25), by = "Grp"),
summarise_all(.,funs(Q2 = quantile), probs = 0.50), by = "Grp"),
summarise_all(.,funs(Q3 = quantile), probs = 0.75), by = "Grp"
)
} %>% as.data.frame() %>%
gather(key, val, -Grp) %>%
separate("key", c("sub", "param"), sep = "_") %>%
unite("sub", c("sub", "Grp"), sep = "_") %>%
dcast(param~sub, value.var = "val") %>%
select_at(vars(param, sort(names(select(.,-param)))))
# param Math.marks_A Math.marks_B Phy.marks_A Phy.marks_B
#1 max 45.0 100.0 47 99.0
#2 mean 33.2 66.4 45 63.6
#3 median 34.0 80.0 45 60.0
#4 min 21.0 24.0 43 25.0
#5 Q1 32.0 40.0 44 44.0
#6 Q2 34.0 80.0 45 60.0
#7 Q3 34.0 88.0 46 90.0
数据强>
df_a <- structure(list(Pkey = c(1, 2, 3, 4, 5),
Phy.marks = c(43, 44, 45, 46, 47), Math.marks = c(34, 34, 45, 32, 21)),
.Names = c("Pkey", "Phy.marks", "Math.marks"),
row.names = c(NA, -5L), class = "data.frame")
df_b <- structure(list(Pkey = c(11, 12, 13, 14, 15),
Phy.marks = c(90, 44, 60, 25, 99),
Math.marks = c(24, 40, 80, 88, 100)),
.Names = c("Pkey", "Phy.marks", "Math.marks"),
row.names = c(NA, -5L), class = "data.frame")