如何匹配不正确的字符串并替换为正确的字符串

时间:2020-05-17 14:45:04

标签: r if-statement match

我有两个数据框。

一个包含正确和不正确的地名对:

place  <- data.frame(
  place_correct = c("London", "Birmingham", "Newcastle", "Brighton"),
  place_incorrect = c("Lundn", "Birmgham", "Nexcassle", "Briton"), stringsAsFactors = F)

另一个包含包含以下正确和不正确的地名的列:

set.seed(123)
df <- data.frame(town = sample(c("London", "Birmingham", "Newcastle", "Brighton", 
                                 "Lundn", "Birmgham", "Nexcassle", "Briton"), 20, replace = T), stringsAsFactors = F)

我想做的是将df中不正确的地名与place中不正确的地名相匹配,并用正确的地名替换。

编辑

我可以在base R中使用ifelse%in%来做到这一点:

df$town_correct <- ifelse(df$town %in% place$place_incorrect, 
                          place$place_correct[match(df$town, place$place_incorrect)], 
                          df$town)
df
         town town_correct
1   Newcastle    Newcastle
2   Nexcassle    Newcastle
3    Brighton     Brighton
4      Briton     Brighton
5      Briton     Brighton
6      London       London
7       Lundn       London
8      Briton     Brighton
9       Lundn       London
10   Brighton     Brighton
11     Briton     Brighton
12   Brighton     Brighton
13   Birmgham   Birmingham
14      Lundn       London
15     London       London
16     Briton     Brighton
17 Birmingham   Birmingham
18     London       London
19  Newcastle    Newcastle
20     Briton     Brighton

但是如何在dplyr中完成?

3 个答案:

答案 0 :(得分:2)

我将使用此multisub函数:

place  <- data.frame(
    place_correct = c("London", "Birmingham", "Newcastle", "Brighton"),
    place_incorrect = c("Lundn", "Birmgham", "Nexcassle", "Briton"), stringsAsFactors = F)

set.seed(123)
df <- data.frame(town = sample(c("London", "Birmingham", "Newcastle", "Brighton", 
                                 "Lundn", "Birmgham", "Nexcassle", "Briton"), 20, replace = T), stringsAsFactors = F)


multisub <- function(target, output, string) {
    replacement.list <- apply(cbind(target, output), 1, as.list)
    mygsub <- function(l, x) gsub(pattern = l[1], replacement = l[2], x, perl=TRUE)
    Reduce(mygsub, replacement.list, init = string, right = TRUE)
}


df$town_correct <- with(place, multisub(place_incorrect, place_correct, df$town))
df
#>          town town_correct
#> 1   Nexcassle    Newcastle
#> 2   Nexcassle    Newcastle
#> 3   Newcastle    Newcastle
#> 4    Birmgham   Birmingham
#> 5   Newcastle    Newcastle
#> 6  Birmingham   Birmingham
#> 7  Birmingham   Birmingham
#> 8    Birmgham   Birmingham
#> 9   Newcastle    Newcastle
#> 10      Lundn       London
#> 11   Brighton     Brighton
#> 12   Birmgham   Birmingham
#> 13   Birmgham   Birmingham
#> 14     London       London
#> 15 Birmingham   Birmingham
#> 16  Newcastle    Newcastle
#> 17     Briton     Brighton
#> 18      Lundn       London
#> 19  Newcastle    Newcastle
#> 20  Newcastle    Newcastle

reprex package(v0.3.0)于2020-05-17创建

编辑:

这可能不是最有效的解决方案,但是在检查匹配项之后,这是ifelse的解决方案:

df$town_correct <- vapply(df$town, function(x) ifelse(x %in% place$place_incorrect, 
place[match(x, place$place_incorrect, nomatch=0), "place_correct"], x), 
FUN.VALUE = NA_character_, USE.NAMES = FALSE)
df
#>          town town_correct
#> 1   Nexcassle    Newcastle
#> 2   Nexcassle    Newcastle
#> 3   Newcastle    Newcastle
#> 4    Birmgham   Birmingham
#> 5   Newcastle    Newcastle
#> 6  Birmingham   Birmingham
#> 7  Birmingham   Birmingham
#> 8    Birmgham   Birmingham
#> 9   Newcastle    Newcastle
#> 10      Lundn       London
#> 11   Brighton     Brighton
#> 12   Birmgham   Birmingham
#> 13   Birmgham   Birmingham
#> 14     London       London
#> 15 Birmingham   Birmingham
#> 16  Newcastle    Newcastle
#> 17     Briton     Brighton
#> 18      Lundn       London
#> 19  Newcastle    Newcastle
#> 20  Newcastle    Newcastle

答案 1 :(得分:2)

您在基数R中使用的同一ifelse()语句也可用于dplyr:

library(dplyr)

df %>%
  mutate(correct_town = if_else(town %in% place$place_incorrect, 
                            place$place_correct[match(town, place$place_incorrect)], 
                            town))

         town correct_town
1   Nexcassle    Newcastle
2   Nexcassle    Newcastle
3   Newcastle    Newcastle
4    Birmgham   Birmingham
5   Newcastle    Newcastle
6  Birmingham   Birmingham
7  Birmingham   Birmingham
8    Birmgham   Birmingham
9   Newcastle    Newcastle
10      Lundn       London
11   Brighton     Brighton
12   Birmgham   Birmingham
13   Birmgham   Birmingham
14     London       London
15 Birmingham   Birmingham
16  Newcastle    Newcastle
17     Briton     Brighton
18      Lundn       London
19  Newcastle    Newcastle
20  Newcastle    Newcastle

或者stringr::str_replace_all()的替代方案是:

df %>%
  mutate(correct_town = stringr::str_replace_all(town, setNames(place$place_correct, place$place_incorrect)))

答案 2 :(得分:1)

在这种情况下,可以使用left_join软件包中的dplyr。您可以使用以下代码:

    df<-left_join(df, place, by = c("town" = "place_incorrect"))
    df$Town_correct<-ifelse(is.na(df$place_correct), df$town, df$place_correct)
    df$place_correct<-NULL

相关问题