从data.frame行中提取字符列表值并重新整形数据

时间:2018-04-26 22:13:50

标签: r string list dataframe reshape

我有一个变量x,每行都有字符列表:

dat <- data.frame(id = c(rep('a',2),rep('b',2),'c'), 
                  x = c('f,o','f,o,o','b,a,a,r','b,a,r','b,a'), 
                  stringsAsFactors = F)

我想重塑数据,以便每一行都是唯一的(idx)对,例如:

dat2 <- data.frame(id = c(rep('a',2),rep('b',3),rep('c',2)), 
                   x = c('f','o','a','b','r','a','b'))

> dat2
id x
1  a f
2  a o
3  b a
4  b b
5  b r
6  c a
7  c b

我尝试通过拆分字符列表并在每行中仅保留唯一列表值来尝试这样做:

dat$x <- sapply(strsplit(dat$x, ','), sort)
dat$x <- sapply(dat$x, unique)
dat <- unique(dat)

> dat
id       x
1  a    f, o
3  b a, b, r
5  c    a, b

但是,我不知道如何继续将行列表转换为单独的行条目。

我将如何做到这一点? 是否有更有效的方法来转换字符串列表以重塑数据,如上所述?

5 个答案:

答案 0 :(得分:4)

您可以使用tidytext::unnest_tokens

library(tidytext)
library(dplyr)

dat %>% 
  unnest_tokens(x1, x) %>% 
  distinct()

  id x1
1  a  f
2  a  o
3  b  b
4  b  a
5  b  r
6  c  b
7  c  a

答案 1 :(得分:2)

具有两行的基本R方法

#get list of X potential vars
x <- strsplit(dat$x, ",")
# construct full data.frame, then use unique to return desired rows
unique(data.frame(id=rep(dat$id, lengths(x)), x=unlist(x)))

返回

   id x
1   a f
2   a o
6   b b
7   b a
9   b r
13  c b
14  c a

如果您不想自己写出变量名称,可以使用setNames

setNames(unique(data.frame(rep(dat$id, lengths(x)), unlist(x))), names(dat))

答案 2 :(得分:2)

我们可以使用separate_rows

library(tidyverse)
dat %>%
  separate_rows(x) %>%
  distinct()
#  id x
#1  a f
#2  a o
#3  b b
#4  b a
#5  b r
#6  c b
#7  c a

答案 3 :(得分:1)

使用splitstackshape::cSplitx列拆分为多列,可以实现解决方案。然后gather和过滤器将有助于实现所需的输出。

library(tidyverse)
library(splitstackshape)

dat %>% cSplit("x", sep=",") %>%
  mutate_if(is.factor, as.character) %>%
  gather(key, value, -id) %>%
  filter(!is.na(value)) %>%
  select(-key) %>% unique()


#     id value
# 1   a     f
# 3   b     b
# 5   c     b
# 6   a     o
# 8   b     a
# 10  c     a
# 13  b     r

答案 4 :(得分:1)

基础解决方案:

temp <- do.call(rbind, apply( dat, 1, 
     function(z){ data.frame(
                    id=z[1], 
                    x = scan(text=z['x'], what="",sep=","),
                    stringsAsFactors=FALSE)} ) )
Read 2 items
Read 3 items
Read 4 items
Read 3 items
Read 2 items
Warning messages:
1: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
  row names were found from a short variable and have been discarded
2: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
  row names were found from a short variable and have been discarded
3: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
  row names were found from a short variable and have been discarded
4: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
  row names were found from a short variable and have been discarded
5: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
  row names were found from a short variable and have been discarded

 temp[!duplicated(temp),]
 #------
   id x
1   a f
2   a o
6   b b
7   b a
9   b r
13  c b
14  c a

要删除所有消息和警告:

 temp <- do.call(rbind, apply( dat, 1, 
     function(z){ suppressWarnings(data.frame(id=z[1], 
         x = scan(text=z['x'], what="",sep=",", quiet=TRUE), stringsAsFactors=FALSE)
                )} ) )
 temp[!duplicated(temp),]