我正试图通过以下方式重新安排R中的数据:
Patient ID,Episode Number,Admission Date (A),Admission Date (H),Admission Time (A),Admission Time (H)
1,5,20/08/2011,21/08/2011,1200,1300
2,6,21/08/2011,22/08/2011,1300,1400
3,7,22/08/2011,23/08/2011,1400,1500
4,8,23/08/2011,24/08/2011,1500,1600
类似于:
Record Type,Patient ID,Episode Number,Admission Date,Admission Time
H,1,5,20/08/2011,1200
A,1,5,21/08/2011,1300
H,2,6,21/08/2011,1300
A,2,6,22/08/2011,1400
H,3,7,22/08/2011,1400
A,3,7,23/08/2011,1500
H,4,8,23/08/2011,1500
A,4,8,24/08/2011,1600
(我使用了CSV格式,因此更容易将它们用作测试数据)。
我尝试了reshape()函数,它有点工作:
> reshape(foo, direction = "long", idvar = 1, varying = 3:dim(foo)[2],
> sep = "..", timevar = "dataset")
Patient.ID Episode.Number dataset Admission.Date Admission.Time
1.A. 1 5 A. 20/08/2011 1200
2.A. 2 6 A. 21/08/2011 1300
3.A. 3 7 A. 22/08/2011 1400
4.A. 4 8 A. 23/08/2011 1500
1.H. 1 5 H. 21/08/2011 1300
2.H. 2 6 H. 22/08/2011 1400
3.H. 3 7 H. 23/08/2011 1500
4.H. 4 8 H. 24/08/2011 1600
但它不是我想要的确切格式(我想要每个“患者ID”,第一行是“H”,第二行是“A”)。
此外,当我将其扩展到读取数据(有250多列)时,它失败了:
> reshape(realdata, direction = "long", idvar = 1, varying =
> 6:dim(foo)[2], sep = "..", timevar = "dataset")
Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying, :
'varying' arguments must be the same length
我认为部分是因为这些名字看起来像:
> colnames(foo)
[1] "Unique.Key"
[2] "Campus.Code"
[3] "UR"
[4] "Terminal.digit"
[5] "Admission.date..A."
[6] "Admission.date..H."
[7] "Admission.time..A."
[8] "Admission.time..H."
.
.
.
[31] "Medicare.Number"
[32] "Payor"
[33] "Doctor.specialty"
[34] "Clinic"
.
.
.
[202] "Admission.Source..A."
[203] "Admission.Source..H."
即。在带有后缀的列之间有“公共列”(没有后缀)(希望这是有道理的)。
答案 0 :(得分:1)
从“重塑”(现在为“reshape2”)套餐中使用melt
和cast
(现在为dcast
及其家人)的建议不会让您找到您的表格寻找您的数据。特别是you'll need to do some additional processing,如果您的最终目标是您描述的“半长”格式。
您提出的问题有两个问题:
首先是结果的排序。作为@RichieCotton points out in his comment和@mac in his answer,调用order()
足以解决该问题。
第二个是错误:
Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying, :
'varying' arguments must be the same length
这是因为,正如您所猜测的,varying = 6:dim(foo)[2]
选择列表中有不变的列。
解决此问题的一种简单方法是使用grep
来识别哪些列不同,并使用它来指定列,而不是像您一样使用(不正确的)catchall。这是一个有效的例子:
set.seed(1)
foo <- data.frame(Unique.Key = 1:4, Campus.Code = LETTERS[1:4],
Admission.Date..A = 11:14, Admission.Date..H = 21:24,
Medicare.Number = letters[1:4], Payor = letters[1:4],
Admission.Source..A = rnorm(4),
Admission.Source..H = rnorm(4))
foo
# Unique.Key Campus.Code Admission.Date..A Admission.Date..H Medicare.Number
# 1 1 A 11 21 a
# 2 2 B 12 22 b
# 3 3 C 13 23 c
# 4 4 D 14 24 d
# Payor Admission.Source..A Admission.Source..H
# 1 a -0.6264538 0.3295078
# 2 b 0.1836433 -0.8204684
# 3 c -0.8356286 0.4874291
# 4 d 1.5952808 0.7383247
找出哪些列不同并将其用作varying
参数:
varyingCols <- grep("\\.\\.A$|\\.\\.H$", names(foo))
out <- reshape(foo, direction = "long", idvar = "Unique.Key",
varying = varyingCols, sep = "..")
out[order(out$Unique.Key, rev(out$time)), ]
# Unique.Key Campus.Code Medicare.Number Payor time Admission.Date Admission.Source
# 1.H 1 A a a H 21 0.3295078
# 1.A 1 A a a A 11 -0.6264538
# 2.H 2 B b b H 22 -0.8204684
# 2.A 2 B b b A 12 0.1836433
# 3.H 3 C c c H 23 0.4874291
# 3.A 3 C c c A 13 -0.8356286
# 4.H 4 D d d H 24 0.7383247
# 4.A 4 D d d A 14 1.5952808
如果您的数据很小(列数不多),您可以手动计算varying
列的位置并指定向量。正如您已经注意到的那样,idvar
或varying
中未指定的任何列都会得到适当的回收。
out <- reshape(foo, direction = "long", idvar = "Unique.Key",
varying = c(3, 4, 7, 8), sep = "..")
答案 1 :(得分:0)
你可以通过使用融合和演员或重塑来获得你所追求的东西,但是你正在寻找一些非常具体的东西,所以直接进行重塑可能更简单。 您可以将原始数据子集化为两个独立的数据框(一个用于A,一个用于H),然后将它们粘合在一起。
下面的代码适用于您的示例数据,但我也尝试灵活地编写它,以便它可以在更大的数据集上工作,只要这些列与..A一致地命名。和..H。后缀。
#grab the common columns and the "A" columns
#(by using grepl to find any column that doesn't end in ".H.")
foo.a <- foo[,!grepl(x=colnames(foo),pattern = "\\.H\\.$")]
#strip the "..A." from the end of the ".A." column names
colnames(foo.a) <- sub(x=colnames(foo.a),
pattern="(.*)\\.\\.A\\.$",
rep = "\\1")
foo.a$Record.Type <- "A"
#grab the common columns and the "H" columns
#(by using grepl to find any column that doesn't end in ".A.")
foo.h <- foo[,!grepl(x=colnames(foo),pattern = "\\.A\\.$")]
#strip the "..H." from the end of the "..H." column names
colnames(foo.h) <- sub(x=colnames(foo.h),
pattern="(.*)\\.\\.H\\.$",
rep = "\\1")
foo.h$Record.Type <- "H"
#stick them back together
new.foo <- rbind(foo.a,foo.h)
#order by Patient.ID
new.foo <- new.foo[with(new.foo,order(Patient.ID)),]
#re-order the columns as you like
new.foo <- new.foo[,c(1,2,5,3,4)]
这给了我:
> new.foo
Patient.ID Episode.Number Record.Type Admission.Date Admission.Time
1 1 5 A 20/08/2011 1200
5 1 5 H 21/08/2011 1300
2 2 6 A 21/08/2011 1300
6 2 6 H 22/08/2011 1400
3 3 7 A 22/08/2011 1400
7 3 7 H 23/08/2011 1500
4 4 8 A 23/08/2011 1500
8 4 8 H 24/08/2011 1600