获取一列中具有相同值的行数和另一列中的正二进制值

时间:2013-03-21 19:27:28

标签: r vector dataframe

(很抱歉这个奇怪的标题,但我想不出一个简短的方法)

由于我在上一个问题中设法过度简化了我的问题,这次我向你提供了实际问题。

提供的数据框包含列“usr”,“usrMsgCnt”和“isRefound”,其中usr是一个名称,usrMsgCnt是一个数字,isRefound是二进制。

要添加一个新列,其值按如下方式计算:

  

usrMsgCnt / usr等于此行的usr的行数   和isRefound等于1

对于示例数据的第一行,新值为:

  

9/5,其中5是由   长度(数据$ usr [data $ usr ==“Jan.Schrader”& data $ isRefound == 1])

考虑到原始数据集的大小

,循环使用不是一个选项

这是一小部分数据的输入

structure(list(usr = structure(c(21L, 21L, 21L, 21L, 6L, 5L, 
6L, 6L, 6L, 21L, 20L, 21L, 6L, 20L, 21L, 21L, 21L, 6L, 6L, 6L
), .Label = c("alsmith", "Amanda.Coles", "Andrew.Coles", "babsimieth", 
"Bernd.Ludwig", "Bernhard.Schiemann", "bfueck", "Bram.Ridder", 
"brian.tripney", "carlosgardeazabal", "christine.elsweiler", 
"cmfinner", "daniel.goncalves", "david", "de56", "eko.ma", "freundlu", 
"gmcphail", "ian.ferguson", "Ian.Ruthven", "Jan.Schrader", "jearmour", 
"jyang", "Laura.Schnall", "Marc.Roper", "marek.maleika", "Martin.Hacker", 
"martin.scholz", "maziminke", "mclanger", "Michael.Cashmore", 
"morgan.harvey", "mrussell", "msherrif", "murray.wood", "Nadine.Mahrholz", 
"noam.ascher", "pburns", "Peter.Gregory", "raina", "robertnm", 
"ronald.teijeira", "ronaldtf", "sbenus", "starmstr", "steve.neely", 
"Sven.Friedemann", "tinchen"), class = "factor"), usrMsgCnt = c(9L, 
9L, 9L, 9L, 5L, 0L, 5L, 5L, 5L, 9L, 0L, 9L, 5L, 0L, 9L, 9L, 9L, 
37L, 37L, 37L), isRefound = c(0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 
1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L)), .Names = c("usr", 
"usrMsgCnt", "isRefound"), row.names = c(NA, 20L), class = "data.frame")

1 个答案:

答案 0 :(得分:6)

假设isRefound实际上是二进制:

library(data.table)
DT <- data.table(DF,key="usr")

DT[,newvar:=usrMsgCnt/sum(isRefound),by=usr]

编辑:如果订单必不可少,则不应设置密钥(对data.table进行排序)并创建索引变量(为安全起见)。

DT <- data.table(DF)
DT[,id:=.I]
DT[,newvar:=usrMsgCnt/sum(isRefound),by=usr]
print(DT)

#                    usr usrMsgCnt isRefound id newvar
#  1:       Jan.Schrader         9         0  1    1.8
#  2:       Jan.Schrader         9         1  2    1.8
#  3:       Jan.Schrader         9         1  3    1.8
#  4:       Jan.Schrader         9         1  4    1.8
#  5: Bernhard.Schiemann         5         1  5    1.0
#  6:       Bernd.Ludwig         0         0  6    NaN
#  7: Bernhard.Schiemann         5         0  7    1.0
#  8: Bernhard.Schiemann         5         1  8    1.0
#  9: Bernhard.Schiemann         5         1  9    1.0
# 10:       Jan.Schrader         9         1 10    1.8
# 11:        Ian.Ruthven         0         0 11    NaN
# 12:       Jan.Schrader         9         0 12    1.8
# 13: Bernhard.Schiemann         5         1 13    1.0
# 14:        Ian.Ruthven         0         0 14    NaN
# 15:       Jan.Schrader         9         0 15    1.8
# 16:       Jan.Schrader         9         0 16    1.8
# 17:       Jan.Schrader         9         1 17    1.8
# 18: Bernhard.Schiemann        37         0 18    7.4
# 19: Bernhard.Schiemann        37         1 19    7.4
# 20: Bernhard.Schiemann        37         0 20    7.4

同样的概念方法可以用于基础R方法和plyr方法演示at your previous question

within(DF, {
  newvar <- usrMsgCnt/ave(isRefound, usr, FUN = sum)
})

library(plyr)
ddply(DF, .(usr), transform,
      newvar = usrMsgCnt/sum(isRefound))

但是,对于大型数据集,data.table包的性能会更好。