我有一个讨厌的数据表,有几种不同的混乱,我无法弄清楚如何结合使用 tidyr 和 splitstackshape 包。
subject <- c("A", "B", "C")
review <- c("Bill: [1.0]", "Bill: [2.0], Cathy: [3.0]", "Fred: [4.0], Cathy: [2.0]")
data.table(cbind(subject, review))
给出:
subject review
1: A Bill: [1.0]
2: B Bill: [2.0], Cathy: [3.0]
3: C Fred: [4.0], Cathy: [2.0]
这表明 tidyr 混乱,多个变量存储在一列中,还有一些丑陋的格式。
我想要的是一张如下表格:
subject Bill Fred Cathy
A 1.0 0.0 0.0
B 2.0 0.0 3.0
C 0.0 4.0 2.0
答案 0 :(得分:2)
这应该这样做。我建议检查中间结果以了解不同的步骤:
# example setup
library(tidyverse)
subject <- c("A", "B", "C")
review <- c("Bill: [1.0]", "Bill: [2.0], Cathy: [3.0]", "Fred: [4.0], Cathy: [2.0]")
dt <- tibble(subject, review)
# solution
dt %>%
separate_rows(review, sep = ",") %>%
separate(review, c("name", "interval"), sep = ":") %>%
mutate(interval = as.numeric(str_replace_all(interval, "\\[|\\]", ""))) %>%
complete(subject, name) %>%
replace_na(list(interval = 0)) %>%
spread(name, interval)
答案 1 :(得分:2)
以下是使用data.table
library(data.table)
dcast(dt[, strsplit(review, ", "), subject][,
c('v1', 'v2') := tstrsplit(V1, ":\\s+\\[|\\]")],
subject ~ v1, value.var = 'v2', fill = 0)
# subject Bill Cathy Fred
#1: A 1.0 0 0
#2: B 2.0 3.0 0
#3: C 0 2.0 4.0
dt <- data.table (subject, review)
答案 2 :(得分:1)
&#34; splitstackshape&#34;方法同样需要首先拆分为&#34; long&#34;形式,然后再到一个&#34;宽&#34;形成,然后重塑数据。
library(splitstackshape)
library(magrittr)
DT %>%
.[, review := gsub("\\[|\\]", "", review)] %>%
cSplit("review", ",", "long") %>%
cSplit("review", ":", "wide") %>%
dcast(subject ~ review_1, value.var = "review_2", fill = 0)
## subject Bill Cathy Fred
## 1: A 1 0 0
## 2: B 2 3 0
## 3: C 0 2 4
答案 3 :(得分:0)
这可能是另一种方式。
library(data.table)
library(tidyr)
t <- data.table (subject, review)
tmp <- t[,c(text=strsplit(review, " ", fixed = TRUE)), by =subject]
tmp$text <- gsub("[^[:alnum:][:space:].]", "", tmp$text)
subject <- tmp$subject[is.na(extract_numeric(tmp$text))]
col2 <- tmp$text[is.na(extract_numeric(tmp$text))]
col3 <- extract_numeric(tmp$text)[!is.na(extract_numeric(tmp$text))]
tmp2 <- data.frame(cbind (subject, col2, col3))
library(reshape2)
m <- dcast(tmp2, subject~col2, value.var="col3")
m[is.na(m)] <- 0