我有一个包含一列的数据框,其中包含需要标准化的名称。
这是一个例子:
PatientId<- c(1,1,1,2,2,2)
Visit_Date<- c("28/02/2014", "29/04/2014", "10/02/2014", "25/01/2014", "01/02/2014", "08/01/2014")
ClinicName<- c("A","A","A", "B","B","B")
PractitionerName<- c("Ahmad Mobin", "Amhad Mobin", "Ahmaad Mobin", "Hadley wickham", "Hadley Wuckham", "Hadley Wihcam")
example_df<- cbind(PatientId, Visit_Date, ClinicName, PractitionerName)
example_df<- as.data.frame(example_df)
这是关于我如何标准化名称的代码,但是想知道我是否可以使用更清晰的代码:
example_df1<- example_df %>%
filter(str_detect(PractitionerName, "Mobin")==TRUE) %>%
filter(ClinicName=="A") %>%
mutate(PractitionerName="Ahmad Mobin")
#Now adding those changes back to my main dataset `example_df`
temp_df<- example_df%>% anti_join(example_df1, by=c("PatientId",
"Visit_Date"))
example_df<-rbind(example_df1,temp_df)
#-Repeat the above process to standardize "Hadley Wickham"
example_df1<- example_df %>%
filter(str_detect(PractitionerName, "Hadley")==TRUE) %>%
filter(ClinicName=="B") %>%
mutate(PractitionerName="Hadley Wickham")
#Now adding those changes back to my main dataset `example_df`
temp_df<- example_df%>% anti_join(example_df1, by=c("PatientId",
"Visit_Date"))
example_df<-rbind(example_df1,temp_df)
答案 0 :(得分:1)
哦......我意识到我没有正确地阅读你的问题。我会按如下方式执行此任务,如果您有很多这样的任务,您可能希望将其包装在函数中:
example_df$PractitionerName[grepl(".*Mobin.*", example_df$PractitionerName) & example_df$ClinicName == "A"] <- "Ahmad Mobin"
答案 1 :(得分:1)
根据问题,您还可以考虑使用字符串距离
library(stringdist)
practitioners <- c("Ahmad Mobin", "Hadley Wickham")
example_df %>%
mutate(PractitionerName =
practitioners[apply(stringdistmatrix(PractitionerName, practitioners), 1, which.max)])
PatientId Visit_Date ClinicName PractitionerName
1 1 28/02/2014 A Hadley Wickham
2 1 29/04/2014 A Hadley Wickham
3 1 10/02/2014 A Hadley Wickham
4 2 25/01/2014 B Ahmad Mobin
5 2 01/02/2014 B Ahmad Mobin
6 2 08/01/2014 B Ahmad Mobin