我有两张桌子。其中有如下所示的格式。其中一个是表A:
students|Test Score
A | 100
B | 81
C | 92
D | 88
另一张表B我看起来像这样:
Class | Students
1 | {A,D}
2 | {B,C}
我想在R中执行某种操作,我可以在表A中的表B列中搜索数组中列出的学生,并将分数制成以下格式:
Class | Students | Mean Score
1 | {A,D} | 94
2 | {B,C} | 86.5
是否有任何公式可用于搜索,然后通过R中的某些操作合并这些结果?
答案 0 :(得分:4)
使用基数R的简单方法:
df2$mean_score <- sapply(df2$Students, function(x, df) {
students_vec <- unlist( strsplit(gsub("[{}]","", x), split=",") )
mean(df[which( df$students %in% students_vec ), "Test Score"] )
}, df = df1)
df2
# Class Students mean_score
#1 1 {A,D} 94.0
#2 2 {B,C} 86.5
我们在df2
申请了学生专栏,并创建了我们想要的学生的矢量。然后我们只将df1
分组给那些学生并采取均值。请注意,这是假设您的df2$Students
数据是字符串。
数据:强>
df1 <- structure(list(students = c("A", "B", "C", "D"), `Test Score` = c(100L,
81L, 92L, 88L)), .Names = c("students", "Test Score"), row.names = c(NA,
-4L), class = "data.frame")
df2 <- structure(list(Class = 1:2, Students = c("{A,D}", "{B,C}")), .Names = c("Class",
"Students"), row.names = c(NA, -2L), class = "data.frame")
答案 1 :(得分:4)
@MikeH的类似解决方案:
B$MeanScore <- sapply(strsplit(gsub("[{}]","", B$Students), split=","),
function(x) mean(A$Test.Score[A$Students %in% x]))
给出了:
# Class Students MeanScore
#1 1 {A,D} 94.0
#2 2 {B,C} 86.5
答案 2 :(得分:2)
dplyr
和tidyr
解决方案使用unnest
拆分,paste
使用collapse
选项进行汇总。来自@Ben Fasoli的测试数据
A <- read.csv(text = 'Students,Test Score
A, 100
B, 81
C, 92
D, 88', stringsAsFactors = F)
B <- read.csv(text = 'Class, Students
1,"{A,D}"
2,"{B,C}"', stringsAsFactors = F) %>%
mutate(Students = gsub('\\{|\\}', '', Students))
library(dplyr)
library(tidyr)
B %>%
unnest(Students = strsplit(Students, ",")) %>%
inner_join(A) %>%
group_by(Class) %>%
summarize(Students = paste0("{", paste(Students, collapse=","), "}"), mean_score = mean(Test.Score))
# Class Students mean_score
# <int> <chr> <dbl>
# 1 1 {A,D} 94.0
# 2 2 {B,C} 86.5
答案 3 :(得分:0)
可能有更有创意的方法可以做到这一点,但这是使用dplyr
R的解决方案。
library(dplyr)
lapply(B$Class, function(x) {
mask <- B$Class == x
data.frame(Class = x,
Students = unlist(strsplit(B$Students[mask], ',')),
stringsAsFactors = F)
}) %>%
bind_rows() %>%
full_join(A, by = 'Students') %>%
group_by(Class) %>%
summarize(`Mean Score` = mean(Test.Score)) %>%
full_join(B, by = 'Class')
dplyr
包有助于数据操作。这是一个可重复的例子。
library(dplyr)
A <- read.csv(text = 'Students,Test Score
A, 100
B, 81
C, 92
D, 88', stringsAsFactors = F)
B <- read.csv(text = 'Class, Students
1,"{A,D}"
2,"{B,C}"', stringsAsFactors = F) %>%
mutate(Students = gsub('\\{|\\}', '', Students))
str(A)
# 'data.frame': 4 obs. of 2 variables:
# $ Students : chr "A" "B" "C" "D"
# $ Test.Score: int 100 81 92 88
str(B)
# 'data.frame': 2 obs. of 2 variables:
# $ Class : int 1 2
# $ Students: chr "A,D" "B,C"
进行一些角色操作,将B表转换为“长”格式。
C <- lapply(B$Class, function(x) {
mask <- B$Class == x
data.frame(Class = x,
Students = unlist(strsplit(B$Students[mask], ',')),
stringsAsFactors = F)
}) %>%
bind_rows()
str(C)
# 'data.frame': 4 obs. of 2 variables:
# $ Class : int 1 1 2 2
# $ Students: chr "A" "D" "B" "C"
将学生的成绩添加到我们的“长”表中。
D <- full_join(A, C, by = 'Students')
str(D)
# 'data.frame': 4 obs. of 3 variables:
# $ Students : chr "A" "B" "C" "D"
# $ Test.Score: int 100 81 92 88
# $ Class : int 1 2 2 1
按类别对学生进行分组,并计算每个班级的平均值。然后,添加一个列,其中包括哪些学生在课堂上。
E <- D %>%
group_by(Class) %>%
summarize(`Mean Score` = mean(Test.Score)) %>%
full_join(B, by = 'Class')
str(E)
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ Class : int 1 2
# $ Mean Score: num 94 86.5
# $ Students : chr "A,D" "B,C"
答案 4 :(得分:0)
dplyr
和tidyr
的另一个解决方案。 separate_rows
函数可以将行中的字符分开。 data_frame
是一个类似于data.frame
的函数,但它不会将字符列强制转换为因子。
# Load packages
library(dplyr)
library(tidyr)
# Create example data frames
df1 <- data_frame(Students = c("A", "B", "C", "D"),
`Test Score` = c(100, 81, 92, 88))
df2 <- data_frame(Class = c(1, 2),
Students = c("{A,D}", "{B,C}"))
# Create the output
df3 <- df2 %>%
mutate(Students = gsub("\\{|\\}", "", Students)) %>%
separate_rows(Students) %>%
left_join(df1, by = "Students") %>%
group_by(Class) %>%
summarise(`Mean Score` = mean(`Test Score`)) %>%
right_join(df2, by = "Class") %>%
select(Class, Students, `Mean Score`)
df3
# A tibble: 2 × 3
Class Students `Mean Score`
<dbl> <chr> <dbl>
1 1 {A,D} 94.0
2 2 {B,C} 86.5