使用不规则的行长度将数据帧从长到重整形

时间:2013-06-14 01:14:37

标签: r dataframe reshape

我有一个如下数据框:

read.csv(text="num,placed,recovered
1,2013-02-22 12:14:00,2013-02-27 15:14:00
1,2013-03-03 17:32:00,2013-03-07 17:32:00
1,2013-04-24 10:13:00,2013-04-26 07:47:00
1,2013-04-15 14:51:00,2013-04-19 09:36:00
1,2013-04-11 11:56:00,2013-04-15 12:52:00
10,2013-02-22 07:30:00,2013-02-27 14:55:00
10,2013-03-03 17:20:00,2013-03-07 17:20:00
10,2013-04-15 15:22:00,2013-04-19 09:48:00
10,2013-02-17 10:38:00,2013-02-22 07:18:00
10,2013-04-11 10:09:00,2013-04-15 13:21:00
10,2013-04-24 10:07:00,2013-04-26 08:23:00
11,2013-02-22 14:23:00,2013-02-27 15:50:00
11,2013-04-11 12:51:00,2013-04-14 09:40:00
11,2013-04-15 14:45:00,2013-04-19 08:28:00
11,2013-04-19 10:13:00,2013-04-23 12:01:00
14,2013-03-01 13:45:00,2013-03-08 14:28:00
14,2013-02-22 13:22:00,2013-02-27 15:24:00
14,2013-04-04 15:36:00,2013-04-17 15:04:00",header=TRUE)

我想重新安排它,以便num中的每个肠道出现一次,其中所有placedrecovered值都在一行中。以下是一个示例行:

num           placed1          recovered1             placed2          recovered2             placed3          recovered3             placed4          recovered4             placed5          recovered5
1 2013-02-22 12:14:00 2013-02-27 15:14:00 2013-03-03 17:32:00 2013-03-07 17:32:00 2013-04-24 10:13:00 2013-04-26 07:47:00 2013-04-15 14:51:00 2013-04-19 09:36:00 2013-04-11 11:56:00 2013-04-15 12:52:00

某些行将具有不同数量的已放置和已恢复的值。 NA出现在这些地方是很好的。我尝试过使用reshape函数,但似乎无法得到我想要的东西。

我这样做是为了对我正在清理的数据集进行子集化。另一个数据集随时间记录测量值以及收集时间。获取数据的设备存储在num列中。我想获取该数据帧的子集,仅获取该设备所处的时间间隔(每对placedrecovered数据之间的时间)。因此,其他数据框看起来如下所示:

num  temp time
1    5    2013-02-22 12:13:50
1    6    2013-02-22 12:14:00
1    4    2013-02-22 12:14:10
1    9    2013-04-24 09:45:20
1    7    2013-04-24 11:45:50
10   23   2013-03-03 19:23:40

如果我能够成功对其进行子集化,结果将类似于以下

num  temp time
1    6    2013-02-22 12:14:00
1    4    2013-02-22 12:14:10
1    7    2013-04-24 11:45:50
10   23   2013-03-03 19:23:40

1 个答案:

答案 0 :(得分:2)

您只需要在数据集中包含“时间”变量,reshape即可正常工作:

mydf$time <- with(mydf, ave(num, num, FUN = seq_along))
head(mydf)
#   num              placed           recovered time
# 1   1 2013-02-22 12:14:00 2013-02-27 15:14:00    1
# 2   1 2013-03-03 17:32:00 2013-03-07 17:32:00    2
# 3   1 2013-04-24 10:13:00 2013-04-26 07:47:00    3
# 4   1 2013-04-15 14:51:00 2013-04-19 09:36:00    4
# 5   1 2013-04-11 11:56:00 2013-04-15 12:52:00    5
# 6  10 2013-02-22 07:30:00 2013-02-27 14:55:00    1
reshape(mydf, idvar="num", timevar="time", direction = "wide")
#    num            placed.1         recovered.1            placed.2         recovered.2
# 1    1 2013-02-22 12:14:00 2013-02-27 15:14:00 2013-03-03 17:32:00 2013-03-07 17:32:00
# 6   10 2013-02-22 07:30:00 2013-02-27 14:55:00 2013-03-03 17:20:00 2013-03-07 17:20:00
# 12  11 2013-02-22 14:23:00 2013-02-27 15:50:00 2013-04-11 12:51:00 2013-04-14 09:40:00
# 16  14 2013-03-01 13:45:00 2013-03-08 14:28:00 2013-02-22 13:22:00 2013-02-27 15:24:00
#               placed.3         recovered.3            placed.4         recovered.4
# 1  2013-04-24 10:13:00 2013-04-26 07:47:00 2013-04-15 14:51:00 2013-04-19 09:36:00
# 6  2013-04-15 15:22:00 2013-04-19 09:48:00 2013-02-17 10:38:00 2013-02-22 07:18:00
# 12 2013-04-15 14:45:00 2013-04-19 08:28:00 2013-04-19 10:13:00 2013-04-23 12:01:00
# 16 2013-04-04 15:36:00 2013-04-17 15:04:00                <NA>                <NA>
#               placed.5         recovered.5            placed.6         recovered.6
# 1  2013-04-11 11:56:00 2013-04-15 12:52:00                <NA>                <NA>
# 6  2013-04-11 10:09:00 2013-04-15 13:21:00 2013-04-24 10:07:00 2013-04-26 08:23:00
# 12                <NA>                <NA>                <NA>                <NA>
# 16                <NA>                <NA>                <NA>                <NA>

如果你像我上面那样添加了“time”变量,你也可以在制作更长的数据集之后使用“reshape2”包。这个超长的数据集(我在下面称之为“mydf.l”)可能对于子集化比对宽数据集更有用:

library(reshape2)
mydf.l <- melt(mydf, id.vars=c("num", "time"))
head(mydf.l)
#   num time variable               value
# 1   1    1   placed 2013-02-22 12:14:00
# 2   1    2   placed 2013-03-03 17:32:00
# 3   1    3   placed 2013-04-24 10:13:00
# 4   1    4   placed 2013-04-15 14:51:00
# 5   1    5   placed 2013-04-11 11:56:00
# 6  10    1   placed 2013-02-22 07:30:00
dcast(mydf.l, num ~ variable + time)
#   num            placed_1            placed_2            placed_3            placed_4
# 1   1 2013-02-22 12:14:00 2013-03-03 17:32:00 2013-04-24 10:13:00 2013-04-15 14:51:00
# 2  10 2013-02-22 07:30:00 2013-03-03 17:20:00 2013-04-15 15:22:00 2013-02-17 10:38:00
# 3  11 2013-02-22 14:23:00 2013-04-11 12:51:00 2013-04-15 14:45:00 2013-04-19 10:13:00
# 4  14 2013-03-01 13:45:00 2013-02-22 13:22:00 2013-04-04 15:36:00                <NA>
#              placed_5            placed_6         recovered_1         recovered_2
# 1 2013-04-11 11:56:00                <NA> 2013-02-27 15:14:00 2013-03-07 17:32:00
# 2 2013-04-11 10:09:00 2013-04-24 10:07:00 2013-02-27 14:55:00 2013-03-07 17:20:00
# 3                <NA>                <NA> 2013-02-27 15:50:00 2013-04-14 09:40:00
# 4                <NA>                <NA> 2013-03-08 14:28:00 2013-02-27 15:24:00
#           recovered_3         recovered_4         recovered_5         recovered_6
# 1 2013-04-26 07:47:00 2013-04-19 09:36:00 2013-04-15 12:52:00                <NA>
# 2 2013-04-19 09:48:00 2013-02-22 07:18:00 2013-04-15 13:21:00 2013-04-26 08:23:00
# 3 2013-04-19 08:28:00 2013-04-23 12:01:00                <NA>                <NA>
# 4 2013-04-17 15:04:00                <NA>                <NA>                <NA>