Question

请注意我读了一个关于splitting a column of a data frame to multiple columns的类似问题，但我的情况有所不同。

我的数据形式如下：

  name    description      
1 a       hello|hello again|something 
2 b       hello again|something|hello
3 c       hello again|hello
4 d

我想按如下方式拆分描述栏：

  name    description_1 description_2 description_3
1 a       hello         hello again   something 
2 b       hello         hello again   something 
3 c       hello         hello again   N/A
4 d       N/A           N/A           N/A

有什么建议，指示？

编辑：关注@akrun和@Sotos的答案（谢谢！），这里有一个更准确的数据表示：

  name    description      
1 a       add words|change|approximate 
2 b       control|access|approximate
4 d

因此，按字母顺序对数据进行排序会导致：

  name    description_1    description_2 description_3
1 a       add words        approximate   change 
2 b       access           approximate   control 
4 d       N/A           N/A           N/A

虽然，我需要的是：

  name    desc_1      desc_2       desc_3   desc_4   desc_5
1 a       add words   approximate  change   N/A      N/A
2 b       N/A         approximate  N/A      control  access 
4 d       N/A         N/A          N/A      N/A      N/A

我不介意如何对描述进行排序（如果有的话），只要在每一列（desc_1..5），我将具有相同的描述。希望这能澄清我的问题。

Answer 1

我们可以使用match根据description的第一个条目的顺序更改顺序，然后使用cSplit包中的splitstackshape进行拆分，

library(splitstackshape)
#make sure column 'description' is a character
df$description <- as.character(df$description)

ind <- strsplit(df$description, '\\|')[[1]]
df$description <- sapply(strsplit(df$description, '\\|'), function(i) 
                                         paste(i[order(match(i, ind))], collapse = '|'))

cSplit(df, 'description', sep = '|', 'wide')
#   name description_1 description_2 description_3
#1:    a         hello   hello_again     something
#2:    b         hello   hello_again     something
#3:    c         hello   hello_again            NA
#4:    d            NA            NA            NA

Answer 2

我们可以通过拆分＆＃39;描述＆＃39;来使用base R。专栏＆＃39; |＆＃39; （注意：如果＆＃39;说明＆＃39;是factor类，请使用strsplit(as.character(df1$description), ...））到list，sort，然后在最后填写NA对于list元素，其长度小于list元素的最大长度，cbind元素的第一列为＆＃39; df1＆＃39;。

lst <- lapply(strsplit(df1$description, "|", fixed = TRUE), sort)
d1 <- setNames(do.call(rbind.data.frame, lapply(lst, `length<-` 
                 ,max(lengths(lst)))), paste0("description_", 1:3))
cbind(df1[1], d1)
#   name description_1 description_2 description_3
#1    a         hello   hello again     something
#2    b         hello   hello again     something
#3    c         hello   hello again          <NA>
#4    d          <NA>          <NA>          <NA>

编辑：基于@ thelatemail的评论

我们还可以创建factor并指定levels

lvls <- sort(unique(unlist(lst)))
lst <- lapply(lst, function(x) x[order(factor(x, levels = lvls))])

然后使用与＆＃39; d1＆＃39;相同的代码在上面。

另一个选项是cSplit来分割＆＃39;描述＆＃39;专栏并将其重塑为“长”。格式，然后sort，dcast它到＆＃39;宽＆＃39;并加入原始数据集on＆＃39; name＆＃39;

library(splitstackshape)
dcast(cSplit(df1, "description", "|", "long")[, sort(description) , by = name], 
  name ~  paste0("description_", rowid(name)), value.var = "V1")[df1[-2], on = "name"]
#   name description_1 description_2 description_3
#1:    a         hello   hello again     something
#2:    b         hello   hello again     something
#3:    c         hello   hello again            NA
#4:    d            NA            NA            NA

此外，hadleyverse使用separate_rows/spread

library(tidyr)
library(dplyr)
separate_rows(df1, description, sep="[|]") %>%
         arrange(name, description) %>% 
         group_by(name) %>% 
         mutate(Seq = paste0("description_", row_number()) ) %>% 
         spread(Seq, description)
#  name description_1 description_2 description_3
#  <chr>         <chr>         <chr>         <chr>
#1     a         hello   hello again     something
#2     b         hello   hello again     something
#3     c         hello   hello again          <NA>
#4     d                        <NA>          <NA>

更新

关于OP的帖子中的新数据，目前尚不清楚排序。但是，正如OP提到的那样，它并不重要，重要的是“描述”的数量。列

lst <- strsplit(df2$description, "|", fixed = TRUE)
lvls <- sort(unique(unlist(lst)))
d1 <- setNames(do.call(rbind.data.frame, lapply(lst, function(x)  
      ifelse(lvls %in% x, lvls, NA))), paste0("description_", 1:5))
cbind(df2[1], d1) 
# name description_1 description_2 description_3 description_4 description_5
#1    a          <NA>     add words   approximate        change          <NA>
#2    b        access          <NA>   approximate          <NA>       control
#4    d          <NA>          <NA>          <NA>          <NA>          <NA>

Answer 3

你可能知道你想要什么样的格式，但是有两种不同的表现形式。第一个是描述“名称”和相关术语的“长”数据框，

terms = strsplit(as.character(df$description), "|", fixed=TRUE)
data.frame(
    name = rep(df$name, lengths(terms)),
    term = unlist(terms))

第二个是'关联矩阵'，其行和列对应于名称和术语，TRUE值表示特定术语出现在特定行中

term = unlist(terms)
m = matrix(
    FALSE, nrow(df), length(unique(term)),
    dimnames=list(df$name, unique(term)))
idx = cbind(    # a two-column matrix can be used as an 'index' into another matrix
    rep(as.character(df$name), lengths(terms)),
    term)
m[idx] = TRUE

例如（当一个简单的例子以一种可以被剪切并粘贴到 R 会话中的方式提供时，回答问题会更容易）

df = data.frame(
    name=c("a", "b", "c"),
    description=c(
        "add words|change|approximate",
        "control|access|approximate",
        ""))

我们有

>     data.frame(
+         name = rep(df$name, lengths(terms)),
+         term = unlist(terms))
  name        term
1    a   add words
2    a      change
3    a approximate
4    b     control
5    b      access
6    b approximate

和

> m
  add words change approximate control access
a      TRUE   TRUE        TRUE   FALSE  FALSE
b     FALSE  FALSE        TRUE    TRUE   TRUE
c     FALSE  FALSE       FALSE   FALSE  FALSE

“长”数据帧适用于稀疏数据（许多术语，每个术语仅在几行中），用于更密集数据的矩阵表示。如果需要，m可以绑定到原始数据框cbind(df, m)。

r-将列拆分为多列 - 更改模式

3 个答案:

更新