我需要在每个唯一标识符上运行一堆线性模型,但首先我需要进行检查。对于每个唯一身份证和年份,我需要检查以前每月数据至少有24个月,但不超过60个月。因此,当我进行回归时,它应包括每个人每年24至60次上个月(年)数据的观察结果。如果该年度的数据少于24个月,则该年度的数据将被删除,但如果超过60个,则仅使用60个月。
感谢this(感谢@akrun)帖子,我能够为每个人设置线性模型,运行它们,然后输出beta作为两个beta的总和。问题是,这只会在当前年度(12个障碍物)而不是之前的24-60个曲线上进行回归。
编辑:我意识到输入错误了...抱歉
单个cusip dput:
tdata <- structure(list(cusip = c(101L, 101L, 101L, 101L, 101L, 101L,
101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L,
101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L,
101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L,
101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L,
101L, 101L, 101L), date = c(19901130L, 19901031L, 19900928L,
19900831L, 19900731L, 19900629L, 19900531L, 19900430L, 19900330L,
19900228L, 19900131L, 19891229L, 19891130L, 19891031L, 19890929L,
19890831L, 19890731L, 19890630L, 19890531L, 19890428L, 19890331L,
19890228L, 19890131L, 19881230L, 19881130L, 19881031L, 19880930L,
19880831L, 19880729L, 19880630L, 19880531L, 19880429L, 19880331L,
19880229L, 19880129L, 19871231L, 19871130L, 19871030L, 19870930L,
19870831L, 19870731L, 19870630L, 19870529L, 19870430L, 19870331L,
19870227L, 19870130L, 19861231L, 19861128L, 19861031L, 19860930L,
19860829L, 19860731L), fyear = c("1990", "1990", "1990", "1990",
"1990", "1990", "1990", "1990", "1990", "1990", "1990", "1989",
"1989", "1989", "1989", "1989", "1989", "1989", "1989", "1989",
"1989", "1989", "1989", "1988", "1988", "1988", "1988", "1988",
"1988", "1988", "1988", "1988", "1988", "1988", "1988", "1987",
"1987", "1987", "1987", "1987", "1987", "1987", "1987", "1987",
"1987", "1987", "1987", "1986", "1986", "1986", "1986", "1986",
"1986"), month = c("11", "10", "09", "08", "07", "06", "05",
"04", "03", "02", "01", "12", "11", "10", "09", "08", "07", "06",
"05", "04", "03", "02", "01", "12", "11", "10", "09", "08", "07",
"06", "05", "04", "03", "02", "01", "12", "11", "10", "09", "08",
"07", "06", "05", "04", "03", "02", "01", "12", "11", "10", "09",
"08", "07"), ret = c("0.117647", "0.030303", "-0.161017", "-0.186207",
"-0.131737", "0.128378", "0.027778", "-0.162791", "0.131579",
"0.178295", "-0.091549", "0.163934", "-0.089552", "0.007519",
"0.117647", "0.155340", "0.211765", "0.024096", "0.338710", "0.377778",
"0.071429", "-0.176471", "0.378378", "-0.026316", "-0.050000",
"-0.047619", "-0.086957", "-0.061224", "0.088889", "-0.062500",
"-0.040000", "-0.056604", "0.081633", "0.042553", "-0.096154",
"0.238095", "-0.263158", "-0.393617", "-0.160714", "0.400000",
"-0.090909", "-0.200000", "-0.098361", "-0.152778", "0.000000",
"0.107692", "0.460674", "-0.101010", "-0.019802", "0.246914",
"-0.052632", "0.179310", "-0.064516"), ewretd = c(0.035468, -0.057155,
-0.080468, -0.108911, -0.025732, 0.005359, 0.045675, -0.028117,
0.021315, 0.015434, -0.046408, -0.012375, -0.0058, -0.049934,
0.005532, 0.018626, 0.031017, -0.007744, 0.025054, 0.029089,
0.01806, 0.002988, 0.062124, 0.018872, -0.036484, -0.011485,
0.016951, -0.025001, 0.000289, 0.047677, -0.017671, 0.014016,
0.03569, 0.060265, 0.077392, 0.026065, -0.05085, -0.272248, -0.015876,
0.014544, 0.035123, 0.021487, 0.000573, -0.017709, 0.036283,
0.074612, 0.117565, -0.034609, -0.006263, 0.023777, -0.059071,
0.023269, -0.073128), lagewretd = c(-0.004526, 0.035468, -0.057155,
-0.080468, -0.108911, -0.025732, 0.005359, 0.045675, -0.028117,
0.021315, 0.015434, -0.046408, -0.012375, -0.0058, -0.049934,
0.005532, 0.018626, 0.031017, -0.007744, 0.025054, 0.029089,
0.01806, 0.002988, 0.062124, 0.018872, -0.036484, -0.011485,
0.016951, -0.025001, 0.000289, 0.047677, -0.017671, 0.014016,
0.03569, 0.060265, 0.077392, 0.026065, -0.05085, -0.272248, -0.015876,
0.014544, 0.035123, 0.021487, 0.000573, -0.017709, 0.036283,
0.074612, 0.117565, -0.034609, -0.006263, 0.023777, -0.059071,
0.023269)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-53L), .Names = c("cusip", "date", "fyear", "month", "ret", "ewretd",
"lagewretd"))
dplyr代码:
res1 <- tdata %>%
group_by(cusip, fyear) %>%
arrange(desc(date)) %>%
mutate(n=n()) %>%
do(data.frame(., beta=ifelse(.$n > 2,
sum(coef(lm(ret~ewretd+lagewretd, data=.))[-1]), NA)))
更新2:2015年4月13日
这是一个我能想到的for
循环可以解决问题,但是R中的for
循环不是最有效的解决方案。
for (i : unique(cusip)){
for (j : unique(fyear)){
check <- filter(tdata, fyear == i & fyear == i-1 & fyear == i-2 & fyear == i-3 & fyear == i-4)
ifelse(length(check$month < 24), tdata$beta == NA, if(length(check$month >= 60)){
arrange(check, desc(date)),
filter(check, month[1:60,]),
check$beta <- sum(coef(lm(ret~ewretd+lagewretd, data = check))[-1])),
left_join(tdata, check, by=c("cusip", fyear == j))}
更新3:完整样本集
这包括所有相当大的障碍物(323mb)
答案 0 :(得分:1)
从长远来看,您可能想要使用正确的日期。通过将fyear
从字符转换为整数,我向这个方向迈出了一小步。
library(dplyr)
## convert fyear to a proper number and then exploit for sorting
tdata <- tdata %>%
mutate(fyear = fyear %>% as.integer) %>%
arrange(fyear, month)
然后我在tbl
级别汇总fyear
,计算您可用于拟合模型的累计月数。 (我拖动了cusip
,但由于您的数据只包含一个cusip
,我无法确定这一切是否正常。)
## figure out cumulative months available for each year (for each cusip)
yearstuff <- tdata %>%
group_by(cusip, fyear) %>%
summarize(n = n()) %>%
mutate(n_cum = cumsum(n))
yearstuff
# Source: local data frame [5 x 4]
# Groups: cusip
#
# cusip fyear n n_cum
# 1 101 1986 6 6
# 2 101 1987 12 18
# 3 101 1988 12 30
# 4 101 1989 12 42
# 5 101 1990 11 53
我找不到适合dplyr
的自然任务的模型,因为它并不适合group_by
范例。相反,我使用yearstuff
从plyr::ddply()
开始,并为每个cusip
* fyear
组合提取所需的数据。如果没有足够的数据,我拒绝适应模型,如果数据太多,我只需要最近的60个月。
## iterate over rows of yearstuff (for each cusip)
models <- plyr::ddply(yearstuff, ~ cusip + fyear, function(y) {
if(y$n_cum < 24) {
c('(Intercept)' = NA_real_, ewretd = NA_real_, lagewretd = NA_real_)
} else {
my_dat <- tdata %>%
filter(cusip == y$cusip, fyear <= y$fyear) %>%
mutate(rn = row_number(desc(date)))
lm(ret ~ ewretd + lagewretd, my_dat, subset = rn < 61) %>% coef
}
})
models
# cusip fyear (Intercept) ewretd lagewretd
# 1 101 1986 NA NA NA
# 2 101 1987 NA NA NA
# 3 101 1988 -0.01138861 1.614342 0.14885911
# 4 101 1989 0.02467139 1.878295 0.00598857
# 5 101 1990 0.02529068 1.900389 0.05766020
这样就可以根据需要使用估计的系数。我认为这应该扩展到多个cusip
但是谁知道呢?此数据集也不包含超过60个月。显然,您应该对这些结果进行一些抽查,并手动进行#34;!