这是一个普遍公认的事实,R的基本重塑命令速度快,功能强大但语法难以理解。因此我编写了一个快速包装器,我将把它放到taRifx
包的下一个版本中。然而,在我这样做之前,我想要求改进。
这是我的版本,来自@RichieCotton的更新:
# reshapeasy: Version of reshape with way, way better syntax
# Written with the help of the StackOverflow R community
# x is a data.frame to be reshaped
# direction is "wide" or "long"
# vars are the names of the (stubs of) the variables to be reshaped (if omitted, defaults to everything not in id or vary)
# id are the names of the variables that identify unique observations
# vary is the variable that varies. Going to wide this variable will cease to exist. Going to long it will be created.
# omit is a vector of characters which are to be omitted if found at the end of variable names (e.g. price_1 becomes price in long)
# ... are options to be passed to stats::reshape
reshapeasy <- function( data, direction, id=(sapply(data,is.factor) | sapply(data,is.character)), vary=sapply(data,is.numeric), omit=c("_","."), vars=NULL, ... ) {
if(direction=="wide") data <- stats::reshape( data=data, direction=direction, idvar=id, timevar=vary, ... )
if(direction=="long") {
varying <- which(!(colnames(data) %in% id))
data <- stats::reshape( data=data, direction=direction, idvar=id, varying=varying, timevar=vary, ... )
}
colnames(data) <- gsub( paste("[",paste(omit,collapse="",sep=""),"]$",sep=""), "", colnames(data) )
return(data)
}
请注意,您可以在不更改方向以外的选项的情况下从宽到长移动。对我而言,这是可用性的关键。
如果您通过聊天或通过电子邮件将您的信息发送给我,我很乐意在功能帮助文件中确认是否有任何重大改进。
改进可能属于以下几个方面:
实施例
示例数据
x.wide <- structure(list(surveyNum = 1:6, pio_1 = structure(c(2L, 2L, 1L,
2L, 1L, 1L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1",
"2"), class = "factor"), pio_2 = structure(c(2L, 1L, 2L, 1L,
2L, 2L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1",
"2"), class = "factor"), pio_3 = structure(c(2L, 2L, 1L, 1L,
2L, 1L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1",
"2"), class = "factor"), caremgmt_1 = structure(c(2L, 1L, 1L,
2L, 1L, 2L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1",
"2"), class = "factor"), caremgmt_2 = structure(c(1L, 2L, 2L,
2L, 2L, 1L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1",
"2"), class = "factor"), caremgmt_3 = structure(c(1L, 2L, 1L,
2L, 1L, 1L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1",
"2"), class = "factor"), prev_1 = structure(c(1L, 2L, 2L, 1L,
1L, 2L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1",
"2"), class = "factor"), prev_2 = structure(c(2L, 2L, 1L, 2L,
1L, 1L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1",
"2"), class = "factor"), prev_3 = structure(c(2L, 1L, 2L, 2L,
1L, 1L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1",
"2"), class = "factor"), price_1 = structure(c(2L, 1L, 2L, 5L,
3L, 4L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1",
"2", "3", "4", "5", "6"), class = "factor"), price_2 = structure(c(6L,
5L, 5L, 4L, 4L, 2L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1",
"2", "3", "4", "5", "6"), class = "factor"), price_3 = structure(c(3L,
5L, 2L, 5L, 4L, 5L), .Names = c("1", "2", "3", "4", "5", "6"), .Label = c("1",
"2", "3", "4", "5", "6"), class = "factor")), .Names = c("surveyNum",
"pio_1", "pio_2", "pio_3", "caremgmt_1", "caremgmt_2", "caremgmt_3",
"prev_1", "prev_2", "prev_3", "price_1", "price_2", "price_3"
), idvars = "surveyNum", rdimnames = list(structure(list(surveyNum = 1:24), .Names = "surveyNum", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24"
), class = "data.frame"), structure(list(variable = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L), .Label = c("pio",
"caremgmt", "prev", "price"), class = "factor"), .id = c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L)), .Names = c("variable",
".id"), row.names = c("pio_1", "pio_2", "pio_3", "caremgmt_1",
"caremgmt_2", "caremgmt_3", "prev_1", "prev_2", "prev_3", "price_1",
"price_2", "price_3"), class = "data.frame")), row.names = c(NA,
6L), class = c("cast_df", "data.frame"))
x.long <- structure(list(.id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), pio = structure(c(2L,
2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L,
1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L,
1L, 2L, 2L, 1L, 2L, 1L, 1L), .Label = c("1", "2"), class = "factor"),
caremgmt = structure(c(2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L,
1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L,
1L, 2L, 2L), .Label = c("1", "2"), class = "factor"), prev = structure(c(1L,
2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L,
2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L), .Label = c("1",
"2"), class = "factor"), price = structure(c(2L, 1L, 2L,
5L, 3L, 4L, 1L, 5L, 4L, 3L, 1L, 2L, 6L, 6L, 5L, 4L, 6L, 3L,
5L, 6L, 3L, 1L, 2L, 4L, 3L, 5L, 2L, 5L, 4L, 5L, 6L, 6L, 4L,
6L, 4L, 1L, 2L, 3L, 1L, 2L, 2L, 5L, 1L, 6L, 1L, 3L, 4L, 3L,
6L, 5L, 5L, 4L, 4L, 2L, 2L, 2L, 6L, 3L, 1L, 4L, 4L, 5L, 1L,
3L, 6L, 1L, 3L, 5L, 1L, 3L, 6L, 2L), .Label = c("1", "2",
"3", "4", "5", "6"), class = "factor"), surveyNum = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L,
15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L,
16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 1L, 2L, 3L,
4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L,
17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L)), .Names = c(".id",
"pio", "caremgmt", "prev", "price", "surveyNum"), row.names = c(NA,
-72L), class = "data.frame")
实施例
> x.wide
surveyNum pio_1 pio_2 pio_3 caremgmt_1 caremgmt_2 caremgmt_3 prev_1 prev_2 prev_3 price_1 price_2 price_3
1 1 2 2 2 2 1 1 1 2 2 2 6 3
2 2 2 1 2 1 2 2 2 2 1 1 5 5
3 3 1 2 1 1 2 1 2 1 2 2 5 2
4 4 2 1 1 2 2 2 1 2 2 5 4 5
5 5 1 2 2 1 2 1 1 1 1 3 4 4
6 6 1 2 1 2 1 1 2 1 1 4 2 5
> reshapeasy( x.wide, "long", NULL, id="surveyNum", vary="id", sep="_" )
surveyNum id pio caremgmt prev price
1.1 1 1 2 2 1 2
2.1 2 1 2 1 2 1
3.1 3 1 1 1 2 2
4.1 4 1 2 2 1 5
5.1 5 1 1 1 1 3
6.1 6 1 1 2 2 4
1.2 1 2 2 1 2 6
2.2 2 2 1 2 2 5
3.2 3 2 2 2 1 5
4.2 4 2 1 2 2 4
5.2 5 2 2 2 1 4
6.2 6 2 2 1 1 2
1.3 1 3 2 1 2 3
2.3 2 3 2 2 1 5
3.3 3 3 1 1 2 2
4.3 4 3 1 2 2 5
5.3 5 3 2 1 1 4
6.3 6 3 1 1 1 5
> head(x.long)
.id pio caremgmt prev price surveyNum
1 1 2 2 1 2 1
2 1 2 1 2 1 2
3 1 1 1 2 2 3
4 1 2 2 1 5 4
5 1 1 1 1 3 5
6 1 1 2 2 4 6
> head(reshapeasy( x.long, direction="wide", id="surveyNum", vary=".id" ))
surveyNum pio.1 caremgmt.1 prev.1 price.1 pio.3 caremgmt.3 prev.3 price.3 pio.2 caremgmt.2 prev.2 price.2
1 1 2 2 1 2 2 1 2 3 2 1 2 6
2 2 2 1 2 1 2 2 1 5 1 2 2 5
3 3 1 1 2 2 1 1 2 2 2 2 1 5
4 4 2 2 1 5 1 2 2 5 1 2 2 4
5 5 1 1 1 3 2 1 1 4 2 2 1 4
6 6 1 2 2 4 1 1 1 5 2 1 1 2
答案 0 :(得分:3)
我还想看一个订购输出的选项,因为这是我不喜欢在基础R中重塑的一个例子。例如,让我们使用你已经使用的Stata Learning Module: Reshaping data wide to long熟悉。我正在看的例子是“1岁和2岁时的孩子身高和体重”的例子。
以下是我通常使用reshape()
执行的操作:
# library(foreign)
kidshtwt = read.dta("http://www.ats.ucla.edu/stat/stata/modules/kidshtwt.dta")
kidshtwt.l = reshape(kidshtwt, direction="long", idvar=1:2,
varying=3:6, sep="", timevar="age")
# The reshaped data is correct, just not in the order I want it
# so I always have to do another step like this
kidshtwt.l = kidshtwt.l[order(kidshtwt.l$famid, kidshtwt.l$birth),]
由于这是一个令人讨厌的步骤,在重塑数据时我总是要经历这个步骤,我认为将它添加到您的函数中会很有用。
我还建议您至少可以选择使用最终列顺序执行相同的操作,以便从long
重新定位到wide
。
我不确定将此功能集成到您的函数中的最佳方法,但我将它们放在一起,根据变量名称的基本模式对数据框进行排序。
col.name.sort = function(data, patterns) {
a = names(data)
b = length(patterns)
subs = vector("list", b)
for (i in 1:b) {
subs[[i]] = sort(grep(patterns[i], a, value=T))
}
x = unlist(subs)
data[ , x ]
}
可以按以下方式使用。想象一下,我们已将reshapeasy
long
到wide
示例的输出保存为名为a
的数据框,我们希望按“surveyNum”,“caremgmt”排序( 1-3),“prev”(1-3),“pio”(1-3)和“price”(1-3),我们可以使用:
col.name.sort(a, c("sur", "car", "pre", "pio", "pri"))
答案 1 :(得分:2)
一些初步想法:
我一直认为方向命令“宽”和“长”有点模糊。它们是否意味着您要将数据转换为该格式,或者数据是否已经采用该格式?这是你需要学习或查找的东西。您可以通过分离函数reshapeToWide
和reshapeToLong
来避免此问题。作为奖励,每个函数的签名只有一个参数。
我认为你不打算包括这行
varying <- which(!(colnames(x.wide) %in% "surveyNum"))
因为它引用了特定的数据集。
对于第一个参数,我更喜欢data
到x
,因为它清楚地表明输入应该是数据框。
通常更好的形式是首先没有默认值。因此vars
应该在id
和vary
之后。
您可以选择id
和vary
的默认值吗? reshape::melt
默认为因变量的id和数字列的因子和字符列。
答案 2 :(得分:2)
我认为你的例子中可能存在错误。为了从宽到长,我收到以下错误:
> reshapeasy( x.wide, "long", NULL, id="surveyNum", vary="id", sep="_" )
Error in gsub(paste("[", paste(omit, collapse = "", sep = ""), "]$", sep = ""), :
invalid regular expression '[]$', reason 'Missing ']''
删除NULL
可以解决问题。这让我想问一下,NULL
的目的是什么?
我还认为,如果默认情况下生成time
变量,如果没有用户明确指定(如在reshape()
中所做的那样),该功能将得到改进。
例如,请参阅基座reshpae()
中的以下内容:
> head(reshape(x.wide, direction="long", idvar=1, varying=2:13, sep="_"))
surveyNum time pio caremgmt prev price
1.1 1 1 2 2 1 2
2.1 2 1 2 1 2 1
3.1 3 1 1 1 2 2
4.1 4 1 2 2 1 5
5.1 5 1 1 1 1 3
6.1 6 1 1 2 2 4
如果我对此很熟悉,并且我发现你的功能对我来说是“变化的”,我可能会试着尝试:
> head(reshapeasy( x.wide, "long", id="surveyNum", sep="_" ))
Error in `row.names<-.data.frame`(`*tmp*`, value = paste(d[, idvar], times[1L], :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘1.1’
但这不是一个非常有用的错误。也许包括自定义错误消息可能对您的最终功能有用。
允许用户设置与NULL
不同,正如您在当前版本的功能中所做的那样,对我来说似乎也不明智。这会产生如下输出:
> head(reshapeasy( x.wide, "long", id="surveyNum", NULL, sep="_" ))
surveyNum pio caremgmt prev price
1.1 1 2 2 1 2
2.1 2 2 1 2 1
3.1 3 1 1 2 2
4.1 4 2 2 1 5
5.1 5 1 1 1 3
6.1 6 1 2 2 4
这个输出的问题是,如果我需要重新变形,我不能轻易做到。因此,我认为保留reshape生成time
变量的默认选项,但让用户覆盖它可能是一个有用的功能。
答案 3 :(得分:2)
也许对于那些懒惰且不喜欢输入变量名称的人,可以将以下内容添加到函数的头部:
if (is.numeric(id) == 1) {
id = colnames(data)[id]
} else if (is.numeric(id) == 0) {
id = id
}
if (is.numeric(vary) == 1) {
vary = colnames(data)[vary]
} else if (is.numeric(vary) == 0) {
vary = vary
}
然后,按照您的示例,您可以使用以下简写:
reshapeasy(x.wide, direction="long", id=1, sep="_", vary="id")
reshapeasy(x.long, direction="wide", id=6, vary=1)
(我知道,这可能不是一个好习惯,因为以后某些人的可读性可能性较差,或者不易理解,但确实经常发生。)