在R中拆分带分隔符和多个类别的列

时间:2018-04-16 21:52:15

标签: r split multiple-columns

我有一个凌乱的表,它有一个包含多个类别标签的列,由几个分隔符分隔。我希望我们R在每个分隔符处拆分该列,并为每个类别标签创建一个新列。我见过的方法一次只能分成一个分隔符。

我目前的表格如下:

#   ID A B C D          TEXT
# 1  1 a          blue water
# 2  2 a b c     fresh water
# 3  3 a b   d   cold stream
# 4  4   b c d lovely sunset
# 5  5   b c        up there

我想要一个看起来像这样的表:

my_table1 <- my_table %>%
  separate(TYPE, c('A', 'B'), ",")
my_table1
# > docs1
#   ID   A        B          TEXT
# 1  1   a     <NA>    blue water
# 2  2   a        b   fresh water
# 3  3 a;b        f   cold stream
# 4  4   f  b and c lovely sunset
# 5  5 b;c     <NA>      up there

my_table2 <- my_table1 %>%
  separate(A, c('A', 'C' ), ";")
# > docs2
#   ID A    C        B          TEXT
# 1  1 a <NA>     <NA>    blue water
# 2  2 a <NA>        b   fresh water
# 3  3 a    b        f   cold stream
# 4  4 f <NA>  b and c lovely sunset
# 5  5 b    c     <NA>      up there

my_table3 <- my_table2 %>%
  separate(A, c('A', 'D'), "and")
# > docs3
#   ID A    D    C        B          TEXT
# 1  1 a <NA> <NA>     <NA>    blue water
# 2  2 a <NA> <NA>        b   fresh water
# 3  3 a <NA>    b        f   cold stream
# 4  4 f <NA> <NA>  b and c lovely sunset
# 5  5 b <NA>    c     <NA>      up there

以下是我的尝试:

window.updateConsole = function() {
  MathJax.Hub.Queue(["Rerender",MathJax.Hub,"math"]);
  MathJax.Hub.Queue(function() {
    var math = MathJax.Hub.getAllJax("mathDiv")[0];

    MathJax.Hub.Queue(["Text", math, "R( \\theta ) = sin^{ \\class{hover P}{" + P.show + "} } \\left ( \\frac{\\class{hover B}{" + sign_mult(B.show) + "} ⋅ \\class{hover S}{" + sign_mult(S.show) + "} ⋅ \\class{hover J}{" + sign_mult(J.show) + "} ⋅ \\theta ⋅ ( \\theta \\class{hover S}{" + sign_sum(-(S.show)) + "})}{\\class{hover N}{" + N.show + "}} \\right ) \\;  \\mapsto  \\;  \\left\\{\\begin{array}i x(\\theta) = \\class{hover C}{" + C.show + "} \\class{hover E}{"  + sign_sum(E.show) + "} ⋅ R(\\theta) ⋅ cos^{\\class{hover H}{" + H.show + "}}(\\theta)\\\\y(\\theta) =\\class{hover D}{" + D.show + "} \\class{hover F}{" + sign_sum(F.show) + "} ⋅ R(\\theta) ⋅ sin^{\\class{hover Z}{" + Z.show + "}}(\\theta)\\end{array}\\right."]);
    MathJax.Hub.Queue(setConsoleWidth);
  });
};

这让我很接近,但列名已关闭。另外,我不想在几次迭代后猜测字符串“b和c”的结束位置。我有数千行,也许有五六个类别。我的猜测是,有一种更简单的方法可以做到这一点。

2 个答案:

答案 0 :(得分:2)

作为替代方案并延长tidyverse次尝试,以下是使用strsplitunnest的解决方案:

df %>%
    mutate(
        val = strsplit(as.character(TYPE), "(;|,\\s*|\\s*and\\s*)")) %>%
    unnest() %>%
    select(-TYPE) %>%
    group_by(ID, TEXT) %>%
    mutate(n = 1:n()) %>%
    spread(n, val)
## A tibble: 5 x 5
## Groups:   ID, TEXT [5]
#     ID TEXT          `1`   `2`   `3`
#  <int> <fct>         <chr> <chr> <chr>
#1     1 blue water    a     NA    NA
#2     2 fresh water   a     b     c
#3     3 cold stream   a     b     f
#4     4 lovely sunset f     b     c
#5     5 up there      b     c     NA

请注意,这与您预期的输出不完全相同。但它确实匹配@ MKR的输出。

样本数据

df <- read.table(text =
    "ID       TYPE          TEXT
1  1          'a'    'blue water'
2  2      'a,b,c'   'fresh water'
3  3      'a;b,f'   'cold stream'
4  4 'f, b and c' 'lovely sunset'
5  5        'b;c'      'up there'")

答案 1 :(得分:1)

cSplit包中的splitstackshape函数可以使问题更容易解决。方法可以是:

library(splitstackshape)

# First use `gsub` to replace other delimiter and have only ',' delimiter. 
my_table$TYPE <- gsub("and|;",",",my_table$TYPE)

Mod_df <- cSplit(my_table, "TYPE", sep = ",")

Mod_df
#    ID          TEXT TYPE_1 TYPE_2 TYPE_3
# 1:  1    blue water      a     NA     NA
# 2:  2   fresh water      a      b      c
# 3:  3   cold stream      a      b      f
# 4:  4 lovely sunset      f      b      c
# 5:  5      up there      b      c     NA

tidyr::gatherspread可用于获取OP提及的格式:

library(tidyr)

gather(Mod_df, key, value, -ID,-TEXT) %>% mutate_if(is.factor, as.character) %>%
  mutate(K = toupper(value)) %>%
  select(-key) %>%
  filter(!is.na(K)) %>%
  spread(K, value)
# ID          TEXT    A    B    C    F
# 1  1    blue water    a <NA> <NA> <NA>
# 2  2   fresh water    a    b    c <NA>
# 3  3   cold stream    a    b <NA>    f
# 4  4 lovely sunset <NA>    b    c    f
# 5  5      up there <NA>    b    c <NA>

数据

my_table <- read.table(text = 
"  ID       TYPE          TEXT
1  1          a    'blue water'
2  2      'a,b,c'   'fresh water'
3  3      'a;b,f'   'cold stream'
4  4 'f, b and c' 'lovely sunset'
5  5        'b;c'      'up there'",
header = TRUE, stringsAsFactors = FALSE)