是否可以基于另一个(日期)变量的首次出现来创建二进制变量?
就我的论文而言,我试图创建一个变量,以捕获该月中发布和修订的首次预测数量除以给定公司在月末的预测数量。为了方便起见,我想将发布和修订的首次预测分开列出。
示例数据
dt <- data.table(
analyst = rep((1:2),10),
id = rep((1:5),4),
year = rep(as.Date(c('2009-12-31','2009-12-31','2010-12-31','2010-12-31'),format='%Y-%m-%d'),5),
fdate = rep(as.Date(c('2009-07-31','2009-02-26','2010-01-31','2010-05-15','2009-06-30','2009-10-08','2010-07-31','2010-11-30','2009-01-31','2009-06-26','2010-05-03','2010-04-13','2009-10-30','2009-11-02','2010-03-28','2010-10-14','2009-02-17','2009-09-14','2010-08-02','2010-10-03'),format='%Y-%m-%d')))
要创建变量,我使用了以下步骤: 首先,使用以下代码来确定给定年份(针对分析师的公司)首次发布的预测:
dt2 <- setkey(setDT(dt), id, year, analyst)[order(fdate),.SD[1L] ,by=list(id,year)]
但是,这将生成仅包含ID,年份和分析师的首次预测的表格。其次,我将首次预测的值设置为:
dt3 <- print(dt2[, first:=1L])
第三,合并两个data.tables:
dt4 <- dt3[dt, on = c('id', 'year', 'analyst', 'fdate')]
第四,我将na替换为0
dt4[is.na(dt4)] <- 0
第五,创建修改后的二进制变量:
dt4$rev <- ifelse(dt4$first == 0,"1", "0")
最后,我总结了一家公司每月的首次修订预测数。
是否有更优雅的方法来创建此变量,以便我可以了解更多R / data.table?根据以下答案,我尝试合并dcast功能:
R data.table - categorical values in one column to binary values in multiple columns
How to programmatically create binary columns based on a categorical variable in data.table?
但是,它对我来说不起作用。
当前结果,基于前面提到的步骤:
id year analyst fdate first rev
1 2009-12-31 1 2009-07-31 1 0
1 2009-12-31 2 2009-10-08 0 1
1 2010-12-31 1 2010-05-03 1 0
1 2010-12-31 2 2010-10-14 0 1
2 2009-12-31 1 2009-02-17 1 0
2 2009-12-31 2 2009-02-26 0 1
2 2010-12-31 1 2010-07-31 0 1
2 2010-12-31 2 2010-04-13 1 0
3 2009-12-31 1 2009-10-30 0 1
3 2009-12-31 2 2009-09-14 1 0
3 2010-12-31 1 2010-01-31 1 0
3 2010-12-31 2 2010-11-30 0 1
4 2009-12-31 1 2009-01-31 1 0
4 2009-12-31 2 2009-11-02 0 1
4 2010-12-31 1 2010-08-02 0 1
4 2010-12-31 2 2010-05-15 1 0
5 2009-12-31 1 2009-06-30 0 1
5 2009-12-31 2 2009-06-26 1 0
5 2010-12-31 1 2010-03-28 1 0
5 2010-12-31 2 2010-10-03 0 1
答案 0 :(得分:2)
我们可以替换ifelse
以及基本的R方法。将“ first”创建为0,然后与基于on
的列中的“ dt2”进行联接,然后将匹配的行分配给1作为“ first”,否定(!
)并使用(+
或as.integer
转换为整数并将其分配给rev
dt[, first := 0][dt2, first := 1, on = .(id, year, analyst, fdate)]
dt[, rev := +(!first)][]
# analyst id year fdate first rev
# 1: 1 1 2009-12-31 2009-07-31 1 0
# 2: 2 1 2009-12-31 2009-10-08 0 1
# 3: 1 1 2010-12-31 2010-05-03 1 0
# 4: 2 1 2010-12-31 2010-10-14 0 1
# 5: 1 2 2009-12-31 2009-02-17 1 0
# 6: 2 2 2009-12-31 2009-02-26 0 1
# 7: 1 2 2010-12-31 2010-07-31 0 1
# 8: 2 2 2010-12-31 2010-04-13 1 0
# 9: 1 3 2009-12-31 2009-10-30 0 1
#10: 2 3 2009-12-31 2009-09-14 1 0
#11: 1 3 2010-12-31 2010-01-31 1 0
#12: 2 3 2010-12-31 2010-11-30 0 1
#13: 1 4 2009-12-31 2009-01-31 1 0
#14: 2 4 2009-12-31 2009-11-02 0 1
#15: 1 4 2010-12-31 2010-08-02 0 1
#16: 2 4 2010-12-31 2010-05-15 1 0
#17: 1 5 2009-12-31 2009-06-30 0 1
#18: 2 5 2009-12-31 2009-06-26 1 0
#19: 1 5 2010-12-31 2010-03-28 1 0
#20: 2 5 2010-12-31 2010-10-03 0 1