交换R中放错位置的细胞?

时间:2019-01-29 09:24:54

标签: r

我有一个巨大的数据库(行数超过65M),并且我发现有些单元格放错了位置。例如,假设我有这个:

library("tidyverse")

DATA <- tribble(
  ~SURNAME,~NAME,~STATE,~COUNTRY,
  'Smith','Emma','California','USA',
  'Johnson','Oliia','Texas','USA',
  'Williams','James','USA','California',
  'Jones','Noah','Pennsylvania','USA',
  'Williams','Liam','Illinois','USA',
  'Brown','Sophia','USA','Louisiana',
  'Daves','Evelyn','USA','Oregon',
  'Miller','Jacob','New Mexico','USA',
  'Williams','Lucas','Connecticut','USA',
  'Daves','John','California','USA',
  'Jones','Carl','USA','Illinois'
)

=====

> DATA
# A tibble: 11 x 4
   SURNAME  NAME   STATE        COUNTRY   
   <chr>    <chr>  <chr>        <chr>     
 1 Smith    Emma   California   USA       
 2 Johnson  Oliia  Texas        USA       
 3 Williams James  USA          California
 4 Jones    Noah   Pennsylvania USA       
 5 Williams Liam   Illinois     USA       
 6 Brown    Sophia USA          Louisiana 
 7 Daves    Evelyn USA          Oregon    
 8 Miller   Jacob  New Mexico   USA       
 9 Williams Lucas  Connecticut  USA       
10 Daves    John   California   USA       
11 Jones    Carl   USA          Illinois 

如您所见,“国家”和“州”在某些行中错位了,我该如何有效地交换这些行?

亲切的问候, 路易斯。

3 个答案:

答案 0 :(得分:2)

使用data.table和内置state.name向量:

setDT(DATA)
DATA[COUNTRY %in% state.name, `:=`(COUNTRY = STATE, STATE = COUNTRY)]

DATA
#      SURNAME   NAME        STATE COUNTRY
#  1:    Smith   Emma   California     USA
#  2:  Johnson  Oliia        Texas     USA
#  3: Williams  James   California     USA
#  4:    Jones   Noah Pennsylvania     USA
#  5: Williams   Liam     Illinois     USA
#  6:    Brown Sophia    Louisiana     USA
#  7:    Daves Evelyn       Oregon     USA
#  8:   Miller  Jacob   New Mexico     USA
#  9: Williams  Lucas  Connecticut     USA
# 10:    Daves   John   California     USA
# 11:    Jones   Carl     Illinois     USA

答案 1 :(得分:1)

检查此解决方案(假设COUNTRY列采用ISO3格式,例如MEX,CAN):

DATA %>%
  mutate(
    COUNTRY_TMP = if_else(str_detect(COUNTRY, '[A-Z]{3}'), COUNTRY, STATE),
    STATE = if_else(str_detect(COUNTRY, '[A-Z]{3}'), STATE, COUNTRY),
    COUNTRY = COUNTRY_TMP
  ) %>%
  select(-COUNTRY_TMP)

答案 2 :(得分:0)

假设所有国家/地区名称均遵循ISO3格式,我们可以首先安装countrycode软件包。在此程序包中,有一个名为codelist的数据框,其中的列iso3c带有ISO3国家/地区名称。我们可以使用以下方法交换国家名称。

library(tidyverse)
library(countrycode)

DATA2 <- DATA %>%
  mutate(STATE2 = ifelse(STATE %in% codelist$iso3c & 
                           !COUNTRY %in% codelist$iso3c, COUNTRY, STATE),
         COUNTRY2 = ifelse(!STATE %in% codelist$iso3c & 
                             COUNTRY %in% codelist$iso3c, COUNTRY, STATE)) %>%
  select(-STATE, -COUNTRY) %>%
  rename(STATE = STATE2, COUNTRY = COUNTRY2)

DATA2
# # A tibble: 11 x 4
#    SURNAME  NAME   STATE        COUNTRY
#    <chr>    <chr>  <chr>        <chr>  
#  1 Smith    Emma   California   USA    
#  2 Johnson  Oliia  Texas        USA    
#  3 Williams James  California   USA    
#  4 Jones    Noah   Pennsylvania USA    
#  5 Williams Liam   Illinois     USA    
#  6 Brown    Sophia Louisiana    USA    
#  7 Daves    Evelyn Oregon       USA    
#  8 Miller   Jacob  New Mexico   USA    
#  9 Williams Lucas  Connecticut  USA    
# 10 Daves    John   California   USA    
# 11 Jones    Carl   Illinois     USA