在R中使用以下数据集 ID = CUSTID
ID Geo Channel Brand Neworstream RevQ112 RevQ212 RevQ312
1 NA On-line 1 New 5 0 1
1 NA On-line 1 Stream 5 0 1
3 EU Tele 2 Stream 5 1 0
我想将数据集转换为这种格式的列
ID Geo Brand Neworstream OnlineRevQ112 TeleRevQ112 OnlineRevQ212 TeleRevQ212
这样做的最佳方法是什么?无法找出R中最好的命令。
提前致谢
答案 0 :(得分:4)
您可以使用reshape2
软件包及其melt
和dcast
函数重新构建数据。
data <- structure(list(ID = c(1L, 1L, 3L), Geo = structure(c(NA, NA,
1L), .Label = "EU", class = "factor"), Channel = structure(c(1L,
1L, 2L), .Label = c("On-line", "Tele"), class = "factor"), Brand = c(1L,
1L, 2L), Neworstream = structure(c(1L, 2L, 2L), .Label = c("New",
"Stream"), class = "factor"), RevQ112 = c(5L, 5L, 5L), RevQ212 = c(0L,
0L, 1L), RevQ312 = c(1L, 1L, 0L)), .Names = c("ID", "Geo", "Channel",
"Brand", "Neworstream", "RevQ112", "RevQ212", "RevQ312"), class = "data.frame", row.names = c(NA,
-3L))
library(reshape2)
## melt data
df_long<-melt(data,id.vars=c("ID","Geo","Channel","Brand","Neworstream"))
## recast in combinations of channel and time frame
dcast(df_long,... ~Channel+variable,sum)
答案 1 :(得分:2)
数据集中的“NA”可能不是NA
值,而是北美的缩写“NA”或类似的东西。
如果您在阅读数据时使用了na.strings
,那么使用我最初指出的reshape
应该没有问题:
mydf <- read.table(header = TRUE, na.strings = "",
text = 'ID Geo Channel Brand Neworstream RevQ112 RevQ212 RevQ312
1 NA On-line 1 New 5 0 1
1 NA On-line 1 Stream 5 0 1
3 EU Tele 2 Stream 5 1 0')
reshape(mydf, direction = "wide",
idvar = c("ID", "Geo", "Brand", "Neworstream"),
timevar = "Channel")
(但是,我可能会建议更改您的易读性缩写并减少混淆!)
reshape
还有一些有趣的东西)这应该这样做:
reshape(mydf, direction = "wide",
idvar = c("ID", "Geo", "Brand", "Neworstream"),
timevar = "Channel")
# ID Geo Brand Neworstream RevQ112.On-line RevQ212.On-line RevQ312.On-line
# 1 1 <NA> 1 New 5 0 1
# 3 3 EU 2 Stream NA NA NA
# RevQ112.Tele RevQ212.Tele RevQ312.Tele
# 1 NA NA NA
# 3 5 1 0
正如@Arun所指出的,上述情况并不完全正确。这里的罪魁祸首是interaction()
,当reshape()
指定了多个ID变量时,reshape()
使用它来创建一个新的临时ID变量。
以下是来自data[, tempidname] <- interaction(data[, idvar], drop = TRUE)
interaction(mydf[c(1, 2, 4, 5)], drop = TRUE)
# [1] <NA> <NA> 3.EU.2.Stream
# Levels: 3.EU.2.Stream
的行以及应用于我们的“mydf”对象时的样子:
NA
嗯。这似乎简化为两个ID,3.EU.2.Stream
和NA
。
如果我们将""
替换为mydf$Geo <- as.character(mydf$Geo)
mydf$Geo[is.na(mydf$Geo)] <- ""
interaction(mydf[c(1, 2, 4, 5)], drop = TRUE)
# [1] 1..1.New 1..1.Stream 3.EU.2.Stream
# Levels: 1..1.New 1..1.Stream 3.EU.2.Stream
会怎样?
reshape()
Aaahh。那更好一点。我们现在有三个唯一的ID ...... reshape(mydf, direction = "wide",
idvar=names(mydf)[c(1, 2, 4, 5)],
timevar="Channel")
# ID Geo Brand Neworstream RevQ112.On-line RevQ212.On-line
# 1 1 1 New 5 0
# 2 1 1 Stream 5 0
# 3 3 EU 2 Stream NA NA
# RevQ312.On-line RevQ112.Tele RevQ212.Tele RevQ312.Tele
# 1 1 NA NA NA
# 2 1 NA NA NA
# 3 NA 5 1 0
似乎有效。
{{1}}