我最近发布了:dplyr, lapply, or Map to identify information from one data.frame and place it into another
我的主要问题是使用dplyr / lapply将两个data.frames组合成一列字符串。字符串是名字,但在两个data.frames中并不总是完全相同。
离。我希望df1中的'Jon'与df2中的'Jonathan'或df1中的'Carol'匹配,以匹配df2中的'Caroline'。
#Below data.frame represents a data.frame with ~30000 rows
Test.Takers <- data.frame(
Paternal = c('Last', 'Last','Last', 'Paternal', 'Paternal', "Father's Name"),
Maternal = c('Maternal', 'Maternal', 'Last', 'Maternal', 'Last', "Mother's Name"),
First = c('Carol', 'Name', 'First', 'Name', 'First', 'Jon'),
id_num = NA,
stringsAsFactors = F)
#Below data.frame represents data.frame with ~12000000 rows
Every.Student.In.The.Country <- data.frame(
Paternal = c('Last', 'Last', 'Last', 'Paternal', 'Paternal', 'Paternal', "Father's Name"),
Maternal = c('Maternal', 'Last', 'Last', 'Maternal', 'Last', 'Maternal', "Mother's Name"),
First = c('Caroline', 'Name', 'First', 'Name', 'First', 'Something Else', 'Jonathan'),
id_num = c(123, 456, 789, 234, 567, 890, 101),
stringsAsFactors = F)
我想出了一个包含str_detect的lapply函数,但速度非常慢:
matching_name_one_row <- function(student_df) {
require(dplyr)
require(stringr)
#Filter through massive file with student information by both last names
indexmp <- Every.Student.In.The.Country %>% filter(Paternal == as.character(student_df$Paternal), Maternal == as.character(student_df$Maternal))
#Use str_detect to identify any potential first name matches in filter
id_num <- indexmp$id_num[str_detect(indexmp$First, as.character(student_df$First))]
#Just return first match from str_detect
return(id_num[1])
}
#Create a list of individual rows to use function on
rowlist <- list()
for(i in 1:nrow(Test.Takers)) {rowlist[[i]]<- Test.Takers[i,]}
#Use lapply on list of individual rows
Test.Takers$id_num <- unlist(lapply(rowlist, matching_name_one_row))
dplyr有两个表动词,如left_join,用于大数据。帧和组合信息。但是,我不知道如何将函数如str_detect或pmatch添加到像left_join这样的函数中