我有一个包含数千个不同位置(城市)名称的列的大型数据框,我需要简化/清理它。
经过非常激烈的斗争并尝试使用正则表达式和循环,我找到了DataCombine包和FindReplace,它意味着做我想要的但我无法让它工作。
所以我有:
UserId Location
1 USR_1 Paris
2 USR_2 London
3 USR_3 Londres
4 USR_4 Neuilly
5 USR_5 Berlin
6 USR_6 London Chelsea
7 USR_7 Berlin Schoenfeld
8 USR_8 Paris-20
9 USR_9 Neuilly
10 USR_10 Friedrischain
清洁只是一种替代,例如“伦敦切尔西”应该是“伦敦”,“布鲁克林”应该是“纽约市”,“巴黎20e”和“巴黎14”应该是“巴黎”。为了更进一步,我希望所有具有“Paris”模式的东西都被“Paris”取代(在SQL中类似于“Paris%”)。
# Data for testing
library(DataCombine)
user_test <- data_frame(x <- paste("USR", as.character(1:10), sep = "_"), y <- c("Paris", "London", "Londres", "Neuilly", " Berlin", "London Chelsea", "Berlin Schoenfeld", "Paris-20", "Neuilly", "Friedrischain"))
colnames(user_test) <- c("UserId","Location")
user_test <- as.data.frame(user_test) ### Not sure why I have to put it there but otherwise it doesn't have the dataframe class
should_be <- data_frame(c("Paris", "London", "Berlin", "Neuilly", "Friedr"), c("Paris", "London", "Berlin", "Paris", "Berlin"))
colnames(should_be) <- c("is","should_be")
# Calling the function
FindReplace(data = user_test, Var = "Location", replaceData = should_be, from = "is", to = "should_be", exact = FALSE, vector = FALSE)
该函数返回:
UserId Location
1 USR_1 Paris
2 USR_2 London
3 USR_3 Londres
4 USR_4 Paris
5 USR_5 Berlin
6 USR_6 London Chelsea
7 USR_7 Berlin Schoenfeld
8 USR_8 Paris-20
9 USR_9 Paris
10 USR_10 Berlinischain
部分清理(字符串已被替换)但不是整个条目。
关于我如何做的任何想法?用grep循环?比赛?或者我真的必须构建一个绝对所有所需条目的清洁数据框。
答案 0 :(得分:0)
聚结。
# Data for testing
library(tidyverse)
left_join(user_test, should_be, by = c("Location"="is")) %>%
mutate(final = coalesce(should_be, Location))
#> # A tibble: 10 x 4
#> UserId Location should_be final
#> <chr> <chr> <chr> <chr>
#> 1 USR_1 Paris Paris Paris
#> 2 USR_2 London London London
#> 3 USR_3 Londres <NA> Londres
#> 4 USR_4 Neuilly Paris Paris
#> 5 USR_5 " Berlin" <NA> " Berlin"
#> 6 USR_6 London Chelsea <NA> London Chelsea
#> 7 USR_7 Berlin Schoenfeld <NA> Berlin Schoenfeld
#> 8 USR_8 Paris-20 <NA> Paris-20
#> 9 USR_9 Neuilly Paris Paris
#> 10 USR_10 Friedrischain <NA> Friedrischain
Created on 2018-03-03 by the reprex package (v0.2.0).