(很抱歉这个奇怪的标题,但我想不出一个简短的方法)
由于我在上一个问题中设法过度简化了我的问题,这次我向你提供了实际问题。
提供的数据框包含列“usr”,“usrMsgCnt”和“isRefound”,其中usr是一个名称,usrMsgCnt是一个数字,isRefound是二进制。
要添加一个新列,其值按如下方式计算:
usrMsgCnt / usr等于此行的usr的行数 和isRefound等于1
对于示例数据的第一行,新值为:
9/5,其中5是由 长度(数据$ usr [data $ usr ==“Jan.Schrader”& data $ isRefound == 1])
考虑到原始数据集的大小
,循环使用不是一个选项这是一小部分数据的输入
structure(list(usr = structure(c(21L, 21L, 21L, 21L, 6L, 5L,
6L, 6L, 6L, 21L, 20L, 21L, 6L, 20L, 21L, 21L, 21L, 6L, 6L, 6L
), .Label = c("alsmith", "Amanda.Coles", "Andrew.Coles", "babsimieth",
"Bernd.Ludwig", "Bernhard.Schiemann", "bfueck", "Bram.Ridder",
"brian.tripney", "carlosgardeazabal", "christine.elsweiler",
"cmfinner", "daniel.goncalves", "david", "de56", "eko.ma", "freundlu",
"gmcphail", "ian.ferguson", "Ian.Ruthven", "Jan.Schrader", "jearmour",
"jyang", "Laura.Schnall", "Marc.Roper", "marek.maleika", "Martin.Hacker",
"martin.scholz", "maziminke", "mclanger", "Michael.Cashmore",
"morgan.harvey", "mrussell", "msherrif", "murray.wood", "Nadine.Mahrholz",
"noam.ascher", "pburns", "Peter.Gregory", "raina", "robertnm",
"ronald.teijeira", "ronaldtf", "sbenus", "starmstr", "steve.neely",
"Sven.Friedemann", "tinchen"), class = "factor"), usrMsgCnt = c(9L,
9L, 9L, 9L, 5L, 0L, 5L, 5L, 5L, 9L, 0L, 9L, 5L, 0L, 9L, 9L, 9L,
37L, 37L, 37L), isRefound = c(0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L,
1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L)), .Names = c("usr",
"usrMsgCnt", "isRefound"), row.names = c(NA, 20L), class = "data.frame")
答案 0 :(得分:6)
假设isRefound
实际上是二进制:
library(data.table)
DT <- data.table(DF,key="usr")
DT[,newvar:=usrMsgCnt/sum(isRefound),by=usr]
编辑:如果订单必不可少,则不应设置密钥(对data.table进行排序)并创建索引变量(为安全起见)。
DT <- data.table(DF)
DT[,id:=.I]
DT[,newvar:=usrMsgCnt/sum(isRefound),by=usr]
print(DT)
# usr usrMsgCnt isRefound id newvar
# 1: Jan.Schrader 9 0 1 1.8
# 2: Jan.Schrader 9 1 2 1.8
# 3: Jan.Schrader 9 1 3 1.8
# 4: Jan.Schrader 9 1 4 1.8
# 5: Bernhard.Schiemann 5 1 5 1.0
# 6: Bernd.Ludwig 0 0 6 NaN
# 7: Bernhard.Schiemann 5 0 7 1.0
# 8: Bernhard.Schiemann 5 1 8 1.0
# 9: Bernhard.Schiemann 5 1 9 1.0
# 10: Jan.Schrader 9 1 10 1.8
# 11: Ian.Ruthven 0 0 11 NaN
# 12: Jan.Schrader 9 0 12 1.8
# 13: Bernhard.Schiemann 5 1 13 1.0
# 14: Ian.Ruthven 0 0 14 NaN
# 15: Jan.Schrader 9 0 15 1.8
# 16: Jan.Schrader 9 0 16 1.8
# 17: Jan.Schrader 9 1 17 1.8
# 18: Bernhard.Schiemann 37 0 18 7.4
# 19: Bernhard.Schiemann 37 1 19 7.4
# 20: Bernhard.Schiemann 37 0 20 7.4
同样的概念方法可以用于基础R方法和plyr方法演示at your previous question:
within(DF, {
newvar <- usrMsgCnt/ave(isRefound, usr, FUN = sum)
})
library(plyr)
ddply(DF, .(usr), transform,
newvar = usrMsgCnt/sum(isRefound))
但是,对于大型数据集,data.table包的性能会更好。