合并两个数据集的最佳方法(可能是函数?)

时间:2017-12-06 23:02:00

标签: r

我正在使用两个数据集。数据集TestA和测试B(以下是如何在R中制作它们)

0

我想合并两个数据集(如果可能,不使用merge())这样,测试A的所有列都填充了TestB提供的信息,并且应该根据类和部分添加它。

我尝试使用合并(TestA,TestB,by = c(' Class',' Section'),all.x = TRUE)但它将观察结果添加到原始TestA。这只是一个测试,但在我使用的数据集中有数百个观察。当我使用这些较小的框架进行操作时,它可以工作,但更大的设置正在发生一些事情。这就是为什么我想知道是否有合并替代方案的原因。

关于如何做到这一点的任何想法?

输出应该如下所示

Instructor <- c('Mr.A','Mr.A','Mr.B', 'Mr.C', 'Mr.D')
Class <- c('French','French','English', 'Math', 'Geometry')
Section <- c('1','2','3','5','5')
Time <- c('9:00-10:00','10:00-11:00','9:00-10:00','9:00-10:00','10:00-11:00')
Date <- c('MWF','MWF','TR','TR','MWF')
Enrollment <- c('30','40','24','29','40')

TestA <- data.frame(Instructor,Class,Section,Time,Date,Enrollment)

rm(Instructor,Class,Section,Time,Date,Enrollment)

Student <- c("Frances","Cass","Fern","Pat","Peter","Kory","Cole")
ID <- c('123','121','101','151','456','789','314')
Instructor <- c('','','','','','','')
Time <- c('','','','','','','')
Date <- c('','','','','','','')
Enrollment <- c('','','','','','','')
Class <- c('French','French','French','French','English', 'Math', 'Geometry')
Section <- c('1','1','2','2','3','5','5')


TestB <- data.frame(Student, ID, Instructor, Class, Section, Time, Date, Enrollment)

rm(Instructor,Class,Section,Time,Date,Enrollment,ID,Student)

2 个答案:

答案 0 :(得分:2)

在我了解merge() dplyrjoin函数之前,我曾经是library(dplyr) TestA %>% left_join(TestB, by = c("Class", "Section")) %>% #Here, you're joining by just the "Class" and "Section" columns of TestA and TestB select(Class, Section, Instructor = Instructor.x, Time = Time.x, Date = Date.x, Enrollment = Enrollment.x, Student, ID) %>% arrange(Class, Section) #Added to match your output. 的忠实粉丝。

请改为尝试:

select

Class Section Instructor Time Date Enrollment Student ID 1 English 3 Mr.B 9:00-10:00 TR 24 Peter 456 2 French 1 Mr.A 9:00-10:00 MWF 30 Frances 123 3 French 1 Mr.A 9:00-10:00 MWF 30 Cass 121 4 French 2 Mr.A 10:00-11:00 MWF 40 Fern 101 5 French 2 Mr.A 10:00-11:00 MWF 40 Pat 151 6 Geometry 5 Mr.D 10:00-11:00 MWF 40 Cole 314 7 Math 5 Mr.C 9:00-10:00 TR 29 Kory 789 语句只保留那些专门命名的列,在某些情况下,重命名它们。

输出:

{{1}}

答案 1 :(得分:2)

关键是在合并/加入之前删除TestB 中空的但重复的列,如SymbolixAU所示。

以下是data.table语法中的实现:

library(data.table)
setDT(TestB)[, .(Student, ID, Class, Section)][setDT(TestA), on = .(Class, Section)]

   Student  ID    Class Section Instructor        Time Date Enrollment
1: Frances 123   French       1       Mr.A  9:00-10:00  MWF         30
2:    Cass 121   French       1       Mr.A  9:00-10:00  MWF         30
3:    Fern 101   French       2       Mr.A 10:00-11:00  MWF         40
4:     Pat 151   French       2       Mr.A 10:00-11:00  MWF         40
5:   Peter 456  English       3       Mr.B  9:00-10:00   TR         24
6:    Kory 789     Math       5       Mr.C  9:00-10:00   TR         29
7:    Cole 314 Geometry       5       Mr.D 10:00-11:00  MWF         40