使用分隔符在数据框中拆分多列(不同长度)中的列

时间:2018-05-08 21:30:41

标签: r dataframe

我有这张桌子:

cca2    ccn3    cca3    borders
AX      248     ALA 
AL      8       ALB     MNE,GRC,MKD,UNK
AD      20      AND     FRA,ESP
AT      40      AUT     CZE,DEU,HUN,ITA,LIE,SVK,SVN,CHE
BE      56      BEL     FRA,DEU,LUX,NLD

并希望在多列中分隔边框。如您所见,边框的数据量不同。

我试过了:

newCountries <- data.frame(do.call('rbind', strsplit(as.character(countries$borders),',',fixed=TRUE)))

但效果不好......我怎么解决这个问题?

我希望结果如下:

cca2    ccn3    cca3    b1   b2   b3  b4  b5  b6  b7  b8
AX      248     ALA     NA   NA   NA  NA  NA  NA  NA  NA
AL      8       ALB     MNE  GRC  MKD UNK NA  NA  NA  NA
AD      20      AND     FRA  ESP  NA  NA  NA  NA  NA  NA
AT      40      AUT     CZE  DEU  HUN ITA LIE SVK SVN CHE
BE      56      BEL     FRA  DEU  LUX NLD NA  NA  NA  NA

3 个答案:

答案 0 :(得分:1)

以下是两种方式。

第一个主要是基础R,但是从separate借用tidyr(随tidyverse一起发货)。为此,我使用sapply来分割borders的每个值中的字符串,然后使用这些值的最大长度。在这种情况下,那是8个边界。然后我用它来确定separate的列名。我认为separate是一个方便的功能,但如果你不确切知道你需要多少列,它有时会很棘手。

第二种方式是基于dplyr,我将borders中的字符串拆分为unnest,将其编成长数据框,根据条目数创建列数对于cca2的每个值,并使用spread将其恢复为宽格式。

library(tidyverse)


max_borders <- max(sapply(df$borders, function(x) length(strsplit(x, ",")[[1]]), simplify = T))
tidyr::separate(df, borders, into = paste0("b", 1:max_borders), sep = ",")
#> Warning: Expected 8 pieces. Missing pieces filled with `NA` in 3 rows [2,
#> 3, 5].
#> # A tibble: 5 x 11
#>   cca2   ccn3 cca3  b1    b2    b3    b4    b5    b6    b7    b8   
#>   <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 AX      248 ALA   <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 2 AL        8 ALB   MNE   GRC   MKD   UNK   <NA>  <NA>  <NA>  <NA> 
#> 3 AD       20 AND   FRA   ESP   <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 4 AT       40 AUT   CZE   DEU   HUN   ITA   LIE   SVK   SVN   CHE  
#> 5 BE       56 BEL   FRA   DEU   LUX   NLD   <NA>  <NA>  <NA>  <NA>


df %>%
    mutate(border_list = str_split(borders, ",")) %>%
    unnest(border_list) %>%
    select(-borders) %>%
    group_by(cca2) %>%
    mutate(col = paste0("b", row_number())) %>%
    spread(key = col, value = border_list)
#> # A tibble: 5 x 11
#> # Groups:   cca2 [5]
#>   cca2   ccn3 cca3  b1    b2    b3    b4    b5    b6    b7    b8   
#>   <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 AD       20 AND   FRA   ESP   <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 2 AL        8 ALB   MNE   GRC   MKD   UNK   <NA>  <NA>  <NA>  <NA> 
#> 3 AT       40 AUT   CZE   DEU   HUN   ITA   LIE   SVK   SVN   CHE  
#> 4 AX      248 ALA   <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 5 BE       56 BEL   FRA   DEU   LUX   NLD   <NA>  <NA>  <NA>  <NA>

reprex package(v0.2.0)创建于2018-05-08。

答案 1 :(得分:1)

另一个选项提供cSplit包中的splitstackshape

library(splitstackshape)
df <- cSplit(indt = df, splitCols = "borders", sep = ",", direction = "wide")
names(df) <- c(names(df)[1:3], paste0("b", 1:8)) #optional
df
#   cca2 ccn3 cca3   b1   b2   b3   b4   b5   b6   b7   b8
#1:   AX  248  ALA <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#2:   AL    8  ALB  MNE  GRC  MKD  UNK <NA> <NA> <NA> <NA>
#3:   AD   20  AND  FRA  ESP <NA> <NA> <NA> <NA> <NA> <NA>
#4:   AT   40  AUT  CZE  DEU  HUN  ITA  LIE  SVK  SVN  CHE
#5:   BE   56  BEL  FRA  DEU  LUX  NLD <NA> <NA> <NA> <NA>

数据

df <- structure(list(cca2 = structure(c(4L, 2L, 1L, 3L, 5L), .Label = c("AD", 
"AL", "AT", "AX", "BE"), class = "factor"), ccn3 = c(248L, 8L, 
20L, 40L, 56L), cca3 = structure(1:5, .Label = c("ALA", "ALB", 
"AND", "AUT", "BEL"), class = "factor"), borders = structure(c(NA, 
4L, 3L, 1L, 2L), .Label = c("CZE,DEU,HUN,ITA,LIE,SVK,SVN,CHE", 
"FRA,DEU,LUX,NLD", "FRA,ESP", "MNE,GRC,MKD,UNK"), class = "factor")), .Names = c("cca2", 
"ccn3", "cca3", "borders"), class = "data.frame", row.names = c(NA, 
-5L))

答案 2 :(得分:1)

这是另一种类似于camille的方法,但是使用separate_rows中的tidyr类似于unnest,但是对于分隔字符串,就像在这种情况下一样。这意味着我们可以避免使用str_split然后使用unnest。然后,我们可以以相同的方式创建列名称和spread

library(tidyverse)
df <- read_table2(
  "cca2    ccn3    cca3    borders
  AX      248     ALA 
  AL      8       ALB     MNE,GRC,MKD,UNK
  AD      20      AND     FRA,ESP
  AT      40      AUT     CZE,DEU,HUN,ITA,LIE,SVK,SVN,CHE
  BE      56      BEL     FRA,DEU,LUX,NLD"
)

df %>%
  separate_rows(borders, sep = ",") %>%
  group_by(cca2) %>%
  mutate(b = row_number()) %>%
  spread(b, borders, sep = "")
#> # A tibble: 5 x 11
#> # Groups:   cca2 [5]
#>   cca2   ccn3 cca3  b1    b2    b3    b4    b5    b6    b7    b8   
#>   <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 AD       20 AND   FRA   ESP   <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 2 AL        8 ALB   MNE   GRC   MKD   UNK   <NA>  <NA>  <NA>  <NA> 
#> 3 AT       40 AUT   CZE   DEU   HUN   ITA   LIE   SVK   SVN   CHE  
#> 4 AX      248 ALA   <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 5 BE       56 BEL   FRA   DEU   LUX   NLD   <NA>  <NA>  <NA>  <NA>

reprex package(v0.2.0)创建于2018-05-08。