data.frame的条件总和基于重复

时间:2016-03-22 14:24:29

标签: mysql r finance

我一直在尝试根据具有重复项的data.frame进行条件求和。我想总结具有相同permno和日期的那些,并创建一个单独的列,其中的信息填写NA或者更好的0&#39。

我的数据集如下所示:

data.frame(crsp)

    permno     date    PAYDT DISTCD divamt FACPR FACSHR   PRC       RET
1   10022 19280929 19281001   1272   0.25     0      0 71.00  0.045208
2   10022 19280929 19281001   1232   1.00     0      0 71.00  0.045208
3   10022 19281031       NA     NA     NA    NA     NA 73.50  0.035211
4   10022 19281130       NA     NA     NA    NA     NA 72.50 -0.013605
5   10022 19281231 19290202   1232   1.00     0      0 68.00 -0.044828
6   10022 19281231 19290202   1272   0.25     0      0 68.00 -0.044828
7   10022 19290131       NA     NA     NA    NA     NA 73.75  0.084559
8   10022 19290228       NA     NA     NA    NA     NA 69.00 -0.064407
9   10022 19290328 19290401   1232   1.00     0      0 65.00 -0.039855
10  10022 19290328 19290401   1272   0.25     0      0 65.00 -0.039855
11  10022 19290430       NA     NA     NA    NA     NA 67.00  0.030769
12  10022 19290531       NA     NA     NA    NA     NA 64.75 -0.033582

首先,我创建了permno + date来制作一个独特的代码

crsp$permnodate = paste(as.character(crsp$permno),as.character(crsp$date),sep="") 

其次,我尝试将重复项加起来并将其转换为新的框架:

crsp_divsingl <- aggregate(crsp$divamt, by = list(permnodate = crsp$permnodate), FUN = sum, na.rm = TRUE)

但是,我无法将此信息正确地传回原始data.frame(crsp),因为列cbindcbind.fill不允许我匹配的列具有不同的长度这是正确的。具体来说,我想要一个/第一个唯一permnodates的divamts的总和,所以它对应于剩余的data.frame长度。我还没有与mergematch取得联系。

我还没有尝试过循环功能,或者成功创建了ififelse个功能。基本上,这可以使用VLOOKUP或index.match公式在excel中完成,但是,这在R中比我最初想的更棘手。

非常感谢帮助。

祝你好运

特勒尔斯

1 个答案:

答案 0 :(得分:0)

您可以使用duplicatedmerge来更轻松地实现这一目标。我写了一个例子。你必须为了你的目的改变它,但希望它会让你走上正确的轨道:

# Creating a fake sample dataset.
set.seed(9)
permno <- 10022:10071 # Allowing 50 possible permno's. 
date <- 19280929:19280978 # Allow 50 possible dates.
value <- c(NA, 1:9) # Allowing NA or a 0 through 9 value.

# Creating fake data frame.
crsp <- data.frame(permno = sample(permno, 1000, TRUE), date = sample(date, 1000, TRUE), value = sample(value, 1000, TRUE))

# Loading a function that uses duplicated to get both the duplicated rows and the original rows.
fullDup <- function(x) {

  bool <- duplicated(x) | duplicated(x, fromLast = TRUE)
  return(bool)

}

# Getting the duplicated rows.
crsp.dup <- crsp[fullDup(crsp[, c("permno", "date")]), ] # fullDup returns a boolean of all the rows that were duplicated to another row by permno and date including the first row.

# Now aggregate.
crsp.dup[is.na(crsp.dup)] <- 0 # Converting NA values to 0.
crsp.dup <- aggregate(value ~ permno + date, crsp.dup, sum)
names(crsp.dup)[3] <- "value.dup" # Changing the name of the value column.

# Now merge back in with the original dataset.
crsp <- merge(crsp, crsp.dup, by = c("permno", "date"), all.x = TRUE)