dplyr left_join包含类似但不完全相同的字符串列(pmatch或str_detect)

时间:2016-06-30 19:43:01

标签: r dplyr lapply bigdata

我最近发布了:dplyr, lapply, or Map to identify information from one data.frame and place it into another

我的主要问题是使用dplyr / lapply将两个data.frames组合成一列字符串。字符串是名字,但在两个data.frames中并不总是完全相同。

离。我希望df1中的'Jon'与df2中的'Jonathan'或df1中的'Carol'匹配,以匹配df2中的'Caroline'。

#Below data.frame represents a data.frame with ~30000 rows
Test.Takers <- data.frame(
    Paternal = c('Last', 'Last','Last', 'Paternal', 'Paternal', "Father's Name"),
    Maternal = c('Maternal', 'Maternal', 'Last', 'Maternal', 'Last', "Mother's Name"),
    First = c('Carol', 'Name', 'First', 'Name', 'First', 'Jon'),
    id_num = NA,
    stringsAsFactors = F)

#Below data.frame represents data.frame with ~12000000 rows
Every.Student.In.The.Country <- data.frame(
    Paternal = c('Last', 'Last', 'Last', 'Paternal', 'Paternal', 'Paternal', "Father's Name"),
    Maternal = c('Maternal', 'Last', 'Last', 'Maternal', 'Last', 'Maternal', "Mother's Name"),
    First = c('Caroline', 'Name', 'First', 'Name', 'First', 'Something Else', 'Jonathan'),
    id_num = c(123, 456, 789, 234, 567, 890, 101),
    stringsAsFactors = F)

我想出了一个包含str_detect的lapply函数,但速度非常慢:

matching_name_one_row <- function(student_df) {
    require(dplyr)
    require(stringr)

    #Filter through massive file with student information by both last names
    indexmp <- Every.Student.In.The.Country %>% filter(Paternal == as.character(student_df$Paternal), Maternal == as.character(student_df$Maternal))

    #Use str_detect to identify any potential first name matches in filter
    id_num <- indexmp$id_num[str_detect(indexmp$First, as.character(student_df$First))]

    #Just return first match from str_detect 
    return(id_num[1])
}

#Create a list of individual rows to use function on
rowlist <- list()
for(i in 1:nrow(Test.Takers)) {rowlist[[i]]<- Test.Takers[i,]}

#Use lapply on list of individual rows
Test.Takers$id_num <- unlist(lapply(rowlist, matching_name_one_row))

dplyr有两个表动词,如left_join,用于大数据。帧和组合信息。但是,我不知道如何将函数如str_detect或pmatch添加到像left_join这样的函数中

0 个答案:

没有答案