我正在尝试在具有销售分类变量的数据框上运行一些基本统计数据(以及稍后更深入的统计数据)。除销售外,它还跟踪区域(商家所在的位置),星期几,一天中的时间(午餐,下班后等)以及其他各种事物。
这是一个小的,随机的数据子集:(注意,这是一个基本的表示 - 实际的数据帧有38列 - 我只是拿出了大部分不适用的数据)
structure(list(dayofweek = structure(c(4L, 7L, 3L, 7L, 3L, 2L,
2L, 7L, 3L, 3L, 2L, 7L, 5L, 5L, 2L, 5L, 1L, 3L, 7L, 3L, 4L, 1L,
3L, 5L, 7L), .Label = c("Friday", "Monday", "Saturday", "Sunday",
"Thursday", "Tuesday", "Wednesday"), class = "factor"), timeofday = structure(c(6L,
4L, 5L, 5L, 2L, 6L, 6L, 5L, 6L, 3L, 6L, 3L, 5L, 4L, 1L, 3L, 5L,
6L, 5L, 4L, 6L, 6L, 3L, 2L, 5L), .Label = c("After Work", "Early AM",
"Evening", "Late AM", "Lunch", "MidAfternoon", "Overnight"), class = "factor"),
area = c(6L, 4L, 4L, 5L, 5L, 1L, 4L, 2L, 3L, 2L, 7L, 3L,
7L, 5L, 7L, 4L, 1L, 4L, 1L, 4L, 5L, 7L, 1L, 3L, 7L), totsales = c(40,
6, 5, 10, 1, 0, 0, 3, 5, 3, 10, 30, 2, 1, 2, 22, 8, 1, 1,
5, 11, 20, 0, 1, 5)), .Names = c("dayofweek", "timeofday",
"area", "totsales"), class = "data.frame", row.names = c(192278L,
140773L, 121051L, 157984L, 154299L, 258034L, 108031L, 43760L,
78005L, 42103L, 95603L, 98431L, 30252L, 165303L, 40713L, 108252L,
304549L, 137041L, 268473L, 124599L, 161253L, 12897L, 240815L,
89439L, 21032L))
我要做的第一件事是尝试在每个区域和每天的每个时间获得平均销售额和中位数。我想让R浏览每个列表并返回所有值。我试过这个:
vallist<-list(a= c("Early AM", "Late AM", "Lunch", "MidAfternoon", "After Work",
"Evening", "Overnight"),
b= c(1,2,3,4,5,6,7))
sapply(vallist[['b']], function(x)
mapply(function(a,b) mean(data$totsales[which(data$timeofday==a & data$area==b)]),
vallist[['a']], vallist[['b']])
)
但是,它仅对区域1中的每个时间段应用均值,而不是对区域1-7中的每个时段进行应用。所以,我的结果看起来像这样:
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
Early AM 9.192847 9.192847 9.192847 9.192847 9.192847 9.192847 9.192847
Late AM 8.020678 8.020678 8.020678 8.020678 8.020678 8.020678 8.020678
Lunch 10.096277 10.096277 10.096277 10.096277 10.096277 10.096277 10.096277
MidAfternoon 11.503961 11.503961 11.503961 11.503961 11.503961 11.503961 11.503961
After Work 8.206124 8.206124 8.206124 8.206124 8.206124 8.206124 8.206124
Evening 11.457599 11.457599 11.457599 11.457599 11.457599 11.457599 11.457599
Overnight 11.415667 11.415667 11.415667 11.415667 11.415667 11.415667 11.415667
这是区域1的正确答案,但您可以看到它们对于每个区域都是相同的值。如何让R将函数应用于多个列表并返回值的所有组合?
接下来的步骤将是应用中位数,并在地区层面和不同的工作日进行评估,但我认为同样的想法将适用于所有不同的组合。
答案 0 :(得分:1)
对于这种特殊情况,您可以使用以下内容重现结果:
library(reshape2)
dcast(data[-1], timeofday ~ area, fun.aggregate=mean, fill=0)
产生:
timeofday 1 2 3 4 5 6 7
1 After Work 0.0 0 0 0.0 0 0 2.0
2 Early AM 0.0 0 1 0.0 1 0 0.0
3 Evening 0.0 3 30 22.0 0 0 0.0
4 Late AM 0.0 0 0 5.5 1 0 0.0
5 Lunch 4.5 3 0 5.0 10 0 3.5
6 MidAfternoon 0.0 0 5 0.5 11 40 15.0
我很确定您的结果与您发布的数据的差异是由于整体的一部分。
答案 1 :(得分:0)
将我的评论转换为答案....
您似乎对aggregate
感兴趣(虽然有许多方式来聚合R中的数据)。
out <- aggregate(totsales ~ timeofday + area, data, mean)
out
# timeofday area totsales
# 1 Evening 1 0.0
# 2 Lunch 1 4.5
# 3 MidAfternoon 1 0.0
# 4 Evening 2 3.0
# 5 Lunch 2 3.0
# 6 Early AM 3 1.0
# 7 Evening 3 30.0
# 8 MidAfternoon 3 5.0
# 9 Evening 4 22.0
# 10 Late AM 4 5.5
# 11 Lunch 4 5.0
# 12 MidAfternoon 4 0.5
# 13 Early AM 5 1.0
# 14 Late AM 5 1.0
# 15 Lunch 5 10.0
# 16 MidAfternoon 5 11.0
# 17 MidAfternoon 6 40.0
# 18 After Work 7 2.0
# 19 Lunch 7 3.5
# 20 MidAfternoon 7 15.0
如果您想从那里转到宽格式,则可以使用reshape
(例如:reshape(out, direction = "wide", idvar="timeofday", timevar="area")
)。