library(stringr)
cat.keys <- list(Internship='Intern',
Information.Technology=c('IT', 'Information Technology', 'Software', 'Developer'),
Healthcare=c('RN', 'LPN', 'Doctor', 'Nurse'),
Maintenance=c('Custodian', 'Janitor'))
jobs.df <- data.frame(Company=c('Big Brothers Big Sisters', 'Big Brothers Big Sisters',
'Big Brothers Big Sisters', 'American Red Cross', 'American Red Cross',
'American Red Cross', 'DeMolay International', 'Legal Aid Association',
'St.Mary’s Church'),
Job.Title = c('Intern', 'Marketing Intern', 'Special Events Internship Program',
'RN', 'Nurse', 'Registered Nurse', 'Director of IT - DeMolay International',
'SWITCHBOARD/INTAKE SPECIALIST', 'CHURCH CUSTODIAN - part-time'))
lapply(jobs.df$Job.Title,
function(x) sapply(cat.keys, function(y) str_detect(x, fixed(y))))
答案 0 :(得分:0)
我认为这可以解决问题:
library(tidyverse)
jobs.df %>% rowwise %>% mutate(
cat = map(cat.keys, ~str_detect(Job.Title, fixed(.x)) %>% any) %>%
{names(.[.==TRUE])} %>% paste(collapse=" - ")
)
如您所见,“ IT主管-DeMolay International”也包含“实习生”,因此分为两类。
您可能想重新考虑列表中的措辞。例如,使用列表中的Intern$
和代码中的regex(.x)
会产生不同的结果,这也可能很有趣(在某些情况下更好,在另一情况下更糟)。不过,这需要一些正则表达式工作。
编辑:
使用rowwise
确实确实很慢。您可以像这样在行(apply
)上使用速度更快的margin=1
:
jobs.df$cat2=jobs.df %>% apply(1, function(x){
map(cat.keys, ~str_detect(x["Job.Title"], fixed(.x)) %>% any) %>%
{names(.[.==TRUE])} %>% paste(collapse=" - ")
})
答案 1 :(得分:0)
使用stack
将命名列表转换为数据框,然后执行merge
-
merge(jobs.df, stack(cat.keys), by.x = 'Job.Title', by.y = 'values')
# Job.Title Company ind
#1 Intern Big Brothers Big Sisters Internship
#2 Nurse American Red Cross Healthcare
#3 RN American Red Cross Healthcare