拆分连续列并使用值填充相应的列

时间:2018-03-08 21:33:47

标签: r tidyr splitstackshape

我有一个讨厌的数据表,有几种不同的混乱,我无法弄清楚如何结合使用 tidyr splitstackshape 包。

subject <- c("A", "B", "C")
review <- c("Bill: [1.0]", "Bill: [2.0], Cathy: [3.0]", "Fred: [4.0], Cathy: [2.0]")
data.table(cbind(subject, review))

给出:

   subject                    review
1:       A               Bill: [1.0]
2:       B Bill: [2.0], Cathy: [3.0]
3:       C Fred: [4.0], Cathy: [2.0]

这表明 tidyr 混乱,多个变量存储在一列中,还有一些丑陋的格式。

我想要的是一张如下表格:

subject  Bill  Fred  Cathy
A        1.0   0.0   0.0
B        2.0   0.0   3.0
C        0.0   4.0   2.0

4 个答案:

答案 0 :(得分:2)

这应该这样做。我建议检查中间结果以了解不同的步骤:

# example setup
library(tidyverse)

subject <- c("A", "B", "C")
review <- c("Bill: [1.0]", "Bill: [2.0], Cathy: [3.0]", "Fred: [4.0], Cathy: [2.0]")
dt <- tibble(subject, review)

# solution
dt %>% 
  separate_rows(review, sep = ",") %>%
  separate(review, c("name", "interval"), sep = ":") %>%
  mutate(interval = as.numeric(str_replace_all(interval, "\\[|\\]", ""))) %>%
  complete(subject, name) %>%
  replace_na(list(interval = 0)) %>%
  spread(name, interval)

答案 1 :(得分:2)

以下是使用data.table

的选项
library(data.table)
dcast(dt[, strsplit(review, ", "),  subject][, 
    c('v1', 'v2') := tstrsplit(V1, ":\\s+\\[|\\]")],
       subject ~ v1, value.var = 'v2', fill = 0)
#   subject Bill Cathy Fred
#1:       A  1.0     0    0
#2:       B  2.0   3.0    0
#3:       C    0   2.0  4.0

数据

dt <- data.table (subject, review) 

答案 2 :(得分:1)

&#34; splitstackshape&#34;方法同样需要首先拆分为&#34; long&#34;形式,然后再到一个&#34;宽&#34;形成,然后重塑数据。

library(splitstackshape)
library(magrittr)

DT %>% 
  .[, review := gsub("\\[|\\]", "", review)] %>% 
  cSplit("review", ",", "long") %>% 
  cSplit("review", ":", "wide") %>% 
  dcast(subject ~ review_1, value.var = "review_2", fill = 0)
##    subject Bill Cathy Fred
## 1:       A    1     0    0
## 2:       B    2     3    0
## 3:       C    0     2    4

答案 3 :(得分:0)

这可能是另一种方式。

library(data.table)
library(tidyr)
t <- data.table (subject, review)
tmp <- t[,c(text=strsplit(review, " ", fixed = TRUE)), by =subject]
tmp$text <- gsub("[^[:alnum:][:space:].]", "", tmp$text)

subject <- tmp$subject[is.na(extract_numeric(tmp$text))]
col2 <- tmp$text[is.na(extract_numeric(tmp$text))]
col3 <- extract_numeric(tmp$text)[!is.na(extract_numeric(tmp$text))]
tmp2 <- data.frame(cbind (subject, col2, col3))
library(reshape2)
m <- dcast(tmp2, subject~col2, value.var="col3")
m[is.na(m)] <- 0