这是我需要采用宽格式的表格:
V1 V2 V3 V4
1 A0 numeric string
1 A1 . .
1 A2 . .
1 A3 . .
1 A4 . .
1 A5 . .
1 A6 . .
1 A7 . .
2 A0 . .
2 A1 . .
... ... . .
我一直在尝试这样的事情:
reshape(variable.name, timevar = "V2", idvar = "V1", direction = "wide")
这导致以下情况,这似乎是我想要的:
V1 V3.A0 V4.A0 V3.A1 ...
1 Numeric String Numeric ...
2 ... ... ... ...
但我收到一条警告信息:
Warning message:
In reshapeWide(data, idvar = idvar, timevar = timevar, varying = varying, :
multiple rows match for V2 = blah: first taken
为什么会发生这种警告,我该如何规避呢?我不想忽略它,因为我必须对几个数据文件做同样的事情。谢谢!非常感谢你的帮助。
答案 0 :(得分:3)
正如一些人所指出的那样,你需要决定你想要用额外的价值做什么。 dcast
允许您指定聚合函数,并且与方向宽的reshape
基本相同,但能够指定具有多个值时要执行的操作。这是一个基本上每个组合都有重复的例子,我们将每个组合的完整向量显示为去除的字符串(例如1:2显示为c(1,2))。
library(reshape2)
# Make up data
df <- data.frame(
V1=rep(1:3, 14),
V2=rep(paste0("A", 0:6), 6),
V3=sample(1:100, 42),
V4=paste0(sample(letters, 42, replace=TRUE), sample(letters, 42, replace=TRUE))
)
# Need to melt V3 and V4 together first because
# dcast does not allow multiple value variables,
# unfortunately, this allso coerces V1 to character
df.melt <- melt(df, id.vars=c("V1", "V2"))
# Function to handle multiple items for one V1 - V2
# pair. In this case we just deparse the vectors,
# but if you wanted, you could convert the numerics
# back to integers, or do whatever you want (e.g.
# paste if character, median if numeric).
my_func <- function(x) {
paste0(deparse(x), collapse="")
}
# Now convert to wide format with dcast
dcast(
df.melt,
V1 ~ V2 + variable,
value.var="value",
fun.aggregate=my_func
)
这导致以下结果:
V1 A0_V3 A0_V4 A1_V3 A1_V4
1 1 c("86", "93") c("yf", "pr") c("5", "76") c("py", "aj")
2 2 c("53", "71") c("as", "mi") c("42", "12") c("ho", "la")
3 3 c("69", "16") c("lm", "un") c("66", "100") c("xk", "px")
A2_V3 A2_V4 A3_V3 A3_V4 A4_V3
1 c("43", "67") c("xh", "bk") c("79", "94") c("ix", "cx") c("51", "50")
2 c("14", "68") c("nq", "sr") c("25", "19") c("dw", "ay") c("28", "35")
3 c("21", "24") c("wu", "il") c("39", "88") c("vz", "yw") c("74", "65")
A4_V4 A5_V3 A5_V4 A6_V3 A6_V4
1 c("hv", "uw") c("85", "34") c("cn", "ql") c("73", "87") c("px", "vy")
2 c("qb", "dc") c("2", "72") c("ci", "du") c("81", "49") c("sd", "rx")
3 c("jk", "fv") c("6", "90") c("sr", "yr") c("62", "97") c("rg", "dv")
完美的解决方案是reshape
和dcast
的组合。不幸的是,dcast
(AFAIK)不允许多个Z列,而reshape
则允许melt
步骤和字符的coersion),而reshape
不允许聚合函数(AFAIK)。
您可以通过运行dcast
两次,一次使用V3
,一次使用V4
,然后合并结果,或在聚合函数中添加更多智能来解决此问题。
答案 1 :(得分:2)
如上所述,如果reshape(...)
和idvar
(您的示例中为timevar
和V1
)的组合不是唯一的,则V2
会生成警告。保证唯一性的一种方法是聚合这两个变量。这也明确了@ Arun的优点,即如果有重复,你必须决定该怎么做。以下是几个选项。
set.seed(1)
# sample dataframe in same format as OP
type <- c("numeric","string")
df <- data.frame(V1=rep(1:10,each=8),V2=paste0("A",0:7),
V3=type[sample(1:2,80, replace=T)],
V4=type[sample(1:2,80, replace=T)])
dupes <- df[sample(1:80,10),] # some random duplicates
dupes[,3:4] <- type
df <- rbind(df,dupes) # append to original df
df.wide <- reshape(df, timevar = "V2", idvar = "V1", direction = "wide")
# many warnings...
func <- function(x) head(x,1) # if duplicate, use first value
df.new <- aggregate(df[c("V3","V4")], by=list(V1=df$V1,V2=df$V2), func)
func <- function(x) tail(x,1) # if duplicate, use last value
df.new <- aggregate(df[c("V3","V4")], by=list(V1=df$V1,V2=df$V2), func)
# if replicated, indicate number of replications
func <- function(x) {ifelse(length(x)==1,as.character(x), length(x))}
df.new <- aggregate(df[c("V3","V4")], by=list(V1=df$V1,V2=df$V2), func)
# if duplicated, flag as such
func <- function(x) {ifelse(length(x)==1,as.character(x),"duplicated")}
df.new <- aggregate(df[c("V3","V4")], by=list(V1=df$V1,V2=df$V2), func)
# if duplicates with different V3 or V4, indicate with "both"
func <- function(x) {ifelse(length(unique(x))==1,as.character(x),"both")}
df.new <- aggregate(df[c("V3","V4")], by=list(V1=df$V1,V2=df$V2), func)
df.wide <- reshape(df.new, timevar = "V2", idvar = "V1", direction = "wide")
# no warnings - reshape succeeded.
这里有一些细微差别可以解释。请注意在函数中使用as.character(x)
。这是因为R将df$V3
和df$V4
视为因素。使用x
将返回因子级别(1或2,因为这些因子只有2个级别)。使用as.character(...)
强制R返回因子标签(“string”或“numeric”)。
最后,请注意,您可以将函数定义直接放在aggregate(...)
的调用中,如:
df.new <- aggregate(df[c("V3","V4")], by=list(V1=df$V1,V2=df$V2),
function(x) {ifelse(length(x)==1,as.character(x),"both")})