Question

我正在尝试使用reshape()重塑以下数据集，但没有太多结果。

起始数据集采用“宽”形式，每个id通过一行描述。该数据集旨在用于进行多态分析（生存分析的概括）。

在给定的总时间范围内记录每个人。在此期间，主体可以在状态之间经历许多转换（为简单起见，我们将两个可以访问的最大数量的不同状态固定）。第一个访问状态是s1 = 1, 2, 3, 4。此人在dur1个时间段内保持在州内，同样适用于第二个访问状态s2：

   id    cohort    s1     dur1     s2     dur2     
     1      1        3      4       2      5       
     2      0        1      4       4      3

我想要获得的长格式数据集是：

id    cohort    s    
1       1       3
1       1       3
1       1       3
1       1       3
1       1       2
1       1       2
1       1       2
1       1       2
1       1       2
2       0       1
2       0       1
2       0       1
2       0       1
2       0       4
2       0       4
2       0       4

实际上，每个ID都有dur1 + dur2行，而s1和s2会在一个变量s中融合。

你将如何进行这种转变？另外，您将如何回到原始数据集“宽”形式？

非常感谢！

dat <- cbind(id=c(1,2), cohort=c(1, 0), s1=c(3, 1), dur1=c(4, 4), s2=c(2, 4), dur2=c(5, 3))

Answer 1

您可以使用reshape()作为第一步，但是您需要做更多的工作。另外，reshape()需要data.frame()作为输入，但您的样本数据是矩阵。

以下是如何继续：

reshape()您的数据从广泛到长：

dat2 <- reshape(data.frame(dat), direction = "long", 
                idvar = c("id", "cohort"),
                varying = 3:ncol(dat), sep = "")
dat2
#       id cohort time s dur
# 1.1.1  1      1    1 3   4
# 2.0.1  2      0    1 1   4
# 1.1.2  1      1    2 2   5
# 2.0.2  2      0    2 4   3

使用data.frame

“展开”生成的rep()

dat3 <- dat2[rep(seq_len(nrow(dat2)), dat2$dur), c("id", "cohort", "s")]
dat3[order(dat3$id), ]
#         id cohort s
# 1.1.1    1      1 3
# 1.1.1.1  1      1 3
# 1.1.1.2  1      1 3
# 1.1.1.3  1      1 3
# 1.1.2    1      1 2
# 1.1.2.1  1      1 2
# 1.1.2.2  1      1 2
# 1.1.2.3  1      1 2
# 1.1.2.4  1      1 2
# 2.0.1    2      0 1
# 2.0.1.1  2      0 1
# 2.0.1.2  2      0 1
# 2.0.1.3  2      0 1
# 2.0.2    2      0 4
# 2.0.2.1  2      0 4
# 2.0.2.2  2      0 4

您也可以使用rownames(dat3) <- NULL删除时髦的行名称。

更新：保留恢复原始格式的能力

在上面的示例中，由于我们删除了“time”和“dur”变量，因此无法直接恢复到原始数据集。如果您觉得这是您需要做的事情，我建议您保留这些列，并根据需要使用您需要的列子集创建另一个data.frame。

以下是：

使用aggregate()返回“dat2”：

aggregate(cbind(s, dur) ~ ., dat3, unique)
#   id cohort time s dur
# 1  2      0    1 1   4
# 2  1      1    1 3   4
# 3  2      0    2 4   3
# 4  1      1    2 2   5

将reshape()包裹起来以回到“dat1”。在这里，一步到位：

reshape(aggregate(cbind(s, dur) ~ ., dat3, unique), 
        direction = "wide", idvar = c("id", "cohort"))
#   id cohort s.1 dur.1 s.2 dur.2
# 1  2      0   1     4   4     3
# 2  1      1   3     4   2     5

Answer 2

可能有更好的方法，但这可能有用。

df <- read.table(text = '
   id    cohort    s1     dur1     s2     dur2     
     1      1        3      4       2      5       
     2      0        1      4       4      3',
header=TRUE)

hist <- matrix(0, nrow=2, ncol=9)
hist

for(i in 1:nrow(df)) {

hist[i,] <- c(rep(df[i,3], df[i,4]), rep(df[i,5], df[i,6]), rep(0, (9 - df[i,4] - df[i,6])))

}

hist

hist2 <- cbind(df[,1:2], hist)
colnames(hist2) <- c('id', 'cohort', paste('x', seq_along(1:9), sep=''))

library(reshape2)

hist3 <- melt(hist2, id.vars=c('id', 'cohort'), variable.name='x', value.name='state')

hist4 <- hist3[order(hist3$id, hist3$cohort),]
hist4

hist4 <- hist4[ , !names(hist4) %in% c("x")]

hist4 <- hist4[!(hist4[,2]==0 & hist4[,3]==0),]

给出：

   id cohort state
1   1      1     3
3   1      1     3
5   1      1     3
7   1      1     3
9   1      1     2
11  1      1     2
13  1      1     2
15  1      1     2
17  1      1     2
2   2      0     1
4   2      0     1
6   2      0     1
8   2      0     1
10  2      0     4
12  2      0     4
14  2      0     4

当然，如果每个id有两个以上的状态，则必须修改它（如果你有两个以上的队列，则可能需要修改它）。例如，我想有9个样本周期，一个人可能处于以下状态序列：

1 3 2 4 3 4 1 1 2

从宽到长重塑，反之亦然（多态/生存分析数据集）

2 个答案:

更新：保留恢复原始格式的能力