为什么在R中重塑时会返回警告?

时间:2013-12-27 05:26:11

标签: r reshape

这是我需要采用宽格式的表格:

V1      V2        V3        V4
1       A0      numeric   string
1       A1         .         .
1       A2         .         .
1       A3         .         .
1       A4         .         .
1       A5         .         .
1       A6         .         .
1       A7         .         .
2       A0         .         .
2       A1         .         .
...     ...        .         .

我一直在尝试这样的事情:

reshape(variable.name, timevar = "V2", idvar = "V1", direction = "wide")

这导致以下情况,这似乎是我想要的:

V1   V3.A0     V4.A0    V3.A1     ...
1    Numeric   String   Numeric   ...
2    ...       ...      ...       ...

但我收到一条警告信息:

Warning message:
In reshapeWide(data, idvar = idvar, timevar = timevar, varying = varying,  :
multiple rows match for V2 = blah: first taken

为什么会发生这种警告,我该如何规避呢?我不想忽略它,因为我必须对几个数据文件做同样的事情。谢谢!非常感谢你的帮助。

2 个答案:

答案 0 :(得分:3)

正如一些人所指出的那样,你需要决定你想要用额外的价值做什么。 dcast允许您指定聚合函数,并且与方向宽的reshape基本相同,但能够指定具有多个值时要执行的操作。这是一个基本上每个组合都有重复的例子,我们将每个组合的完整向量显示为去除的字符串(例如1:2显示为c(1,2))。

library(reshape2)

# Make up data

df <- data.frame(
  V1=rep(1:3, 14), 
  V2=rep(paste0("A", 0:6), 6), 
  V3=sample(1:100, 42), 
  V4=paste0(sample(letters, 42, replace=TRUE), sample(letters, 42, replace=TRUE))  
)    
# Need to melt V3 and V4 together first because
# dcast does not allow multiple value variables,
# unfortunately, this allso coerces V1 to character

df.melt <- melt(df, id.vars=c("V1", "V2")) 

# Function to handle multiple items for one V1 - V2
# pair.  In this case we just deparse the vectors,
# but if you wanted, you could convert the numerics
# back to integers, or do whatever you want (e.g.
# paste if character, median if numeric).

my_func <- function(x) {
  paste0(deparse(x), collapse="")
}
# Now convert to wide format with dcast    

dcast(
  df.melt, 
  V1 ~ V2 + variable,
  value.var="value",
  fun.aggregate=my_func
)

这导致以下结果:

  V1         A0_V3         A0_V4          A1_V3         A1_V4
1  1 c("86", "93") c("yf", "pr")   c("5", "76") c("py", "aj")
2  2 c("53", "71") c("as", "mi")  c("42", "12") c("ho", "la")
3  3 c("69", "16") c("lm", "un") c("66", "100") c("xk", "px")
          A2_V3         A2_V4         A3_V3         A3_V4         A4_V3
1 c("43", "67") c("xh", "bk") c("79", "94") c("ix", "cx") c("51", "50")
2 c("14", "68") c("nq", "sr") c("25", "19") c("dw", "ay") c("28", "35")
3 c("21", "24") c("wu", "il") c("39", "88") c("vz", "yw") c("74", "65")
          A4_V4         A5_V3         A5_V4         A6_V3         A6_V4
1 c("hv", "uw") c("85", "34") c("cn", "ql") c("73", "87") c("px", "vy")
2 c("qb", "dc")  c("2", "72") c("ci", "du") c("81", "49") c("sd", "rx")
3 c("jk", "fv")  c("6", "90") c("sr", "yr") c("62", "97") c("rg", "dv")    

完美的解决方案是reshapedcast的组合。不幸的是,dcast(AFAIK)不允许多个Z列,而reshape则允许melt步骤和字符的coersion),而reshape不允许聚合函数(AFAIK)。

您可以通过运行dcast两次,一次使用V3,一次使用V4,然后合并结果,或在聚合函数中添加更多智能来解决此问题。

答案 1 :(得分:2)

如上所述,如果reshape(...)idvar(您的示例中为timevarV1)的组合不是唯一的,则V2会生成警告。保证唯一性的一种方法是聚合这两个变量。这也明确了@ Arun的优点,即如果有重复,你必须决定该怎么做。以下是几个选项。

set.seed(1)
# sample dataframe in same format as OP
type <- c("numeric","string")
df <- data.frame(V1=rep(1:10,each=8),V2=paste0("A",0:7),
                 V3=type[sample(1:2,80, replace=T)], 
                 V4=type[sample(1:2,80, replace=T)])
dupes <- df[sample(1:80,10),]   # some random duplicates
dupes[,3:4] <- type
df <- rbind(df,dupes)           # append to original df

df.wide <- reshape(df, timevar = "V2", idvar = "V1", direction = "wide")
# many warnings...

func   <- function(x) head(x,1)   # if duplicate, use first value
df.new <- aggregate(df[c("V3","V4")], by=list(V1=df$V1,V2=df$V2), func)

func   <- function(x) tail(x,1)   # if duplicate, use last value
df.new <- aggregate(df[c("V3","V4")], by=list(V1=df$V1,V2=df$V2), func)

# if replicated, indicate number of replications
func   <- function(x) {ifelse(length(x)==1,as.character(x), length(x))}
df.new <- aggregate(df[c("V3","V4")], by=list(V1=df$V1,V2=df$V2), func)

# if duplicated, flag as such
func   <- function(x) {ifelse(length(x)==1,as.character(x),"duplicated")} 
df.new <- aggregate(df[c("V3","V4")], by=list(V1=df$V1,V2=df$V2), func)

# if duplicates with different V3 or V4, indicate with "both"
func   <- function(x) {ifelse(length(unique(x))==1,as.character(x),"both")} 
df.new <- aggregate(df[c("V3","V4")], by=list(V1=df$V1,V2=df$V2), func)

df.wide <- reshape(df.new, timevar = "V2", idvar = "V1", direction = "wide")
# no warnings - reshape succeeded.

这里有一些细微差别可以解释。请注意在函数中使用as.character(x)。这是因为R将df$V3df$V4视为因素。使用x将返回因子级别(1或2,因为这些因子只有2个级别)。使用as.character(...)强制R返回因子标签(“string”或“numeric”)。

最后,请注意,您可以将函数定义直接放在aggregate(...)的调用中,如:

df.new <- aggregate(df[c("V3","V4")], by=list(V1=df$V1,V2=df$V2), 
                    function(x) {ifelse(length(x)==1,as.character(x),"both")})