我有一个数据集,我想执行以下操作,但我无法找到最佳解决方案。
Name Date Paid Outstanding Mark as Follows Close Indicator
A 2000 100 200 Open 0
A 2001 224 100 Open 0
A 2002 348 400 Open 0
A 2003 472 0 First Time it Closes 1
A 2004 596 196 Reopens -1
B 2004 720 200 Open 0
B 2005 844 200 Open 0
B 2006 968 0 First Time it Closes 1
B 2007 968 0 Closes 0
C 2000 1092 200 Open 0
C 2001 1216 1200 Open 0
B 2008 1340 1200 Reopens -1
B 2010 1464 100 Open 0
B 2011 1588 0 Closes 1
A 2016 1712 0 Closes 1
D 2009 1836 60 Open 0
D 2010 1896 0 Closes 1
D 2016 1900 0 Closes 0
我希望能够复制关闭指标列。这些是交易累计金额。我的逻辑是名字,如果付款并且没有Outstanding那么我想把它标记为1,表示接近。但是,如果将来这个案例打开,那么我想在它关闭时再次标记-1和1。所以A在2003年关闭,然后在2004年重新开放,并在2016年结束。
对于D,案件在2010年结束,但是2016年的付款发生了变化,所以虽然从理论上讲这也会得到一个重新开启的标志,因为它同时再次关闭,我希望能够处理这种情况。
在R Data表中执行此操作的最佳方法是什么?
答案 0 :(得分:0)
逻辑是,对于每个名称
对于每个名称,这可以表示为(Outstanding == 0) - (lag(Outstanding) == 0)
。这将获得两个被强制转换为0或1的逻辑之间的区别。
所以我们所要做的就是按名称分组,按日期排序并使用该公式。
library('tidyverse')
df <- tribble(
~Name, ~Date, ~Outstanding,
"A", 2000L, 200L,
"A", 2001L, 100L,
"A", 2002L, 400L,
"A", 2003L, 0L,
"A", 2004L, 196L,
"B", 2004L, 200L,
"B", 2005L, 200L,
"B", 2006L, 0L,
"B", 2007L, 0L,
"C", 2000L, 200L,
"C", 2001L, 1200L,
"B", 2008L, 1200L,
"B", 2010L, 100L,
"B", 2011L, 0L,
"A", 2016L, 0L,
"D", 2009L, 60L,
"D", 2010L, 0L,
"D", 2016L, 0L
)
df %>%
rowid_to_column %>%
group_by(Name) %>%
arrange(Date) %>%
mutate(close_indicator = (Outstanding == 0) - (lag(Outstanding) == 0)) %>%
replace_na(list(close_indicator = 0)) %>%
arrange(rowid)
# # A tibble: 18 x 5
# # Groups: Name [4]
# rowid Name Date Outstanding close_indicator
# <int> <chr> <int> <int> <dbl>
# 1 1 A 2000 200 0
# 2 2 A 2001 100 0
# 3 3 A 2002 400 0
# 4 4 A 2003 0 1
# 5 5 A 2004 196 -1
# 6 6 B 2004 200 0
# 7 7 B 2005 200 0
# 8 8 B 2006 0 1
# 9 9 B 2007 0 0
# 10 10 C 2000 200 0
# 11 11 C 2001 1200 0
# 12 12 B 2008 1200 -1
# 13 13 B 2010 100 0
# 14 14 B 2011 0 1
# 15 15 A 2016 0 1
# 16 16 D 2009 60 0
# 17 17 D 2010 0 1
# 18 18 D 2016 0 0
对于data.table
,可以使用
dt[, close_indicator := (Outstanding == 0) - (shift(Outstanding) == 0), by = Name]
dt[is.na(close_indicator), close_indicator := 0]
# Name Date Outstanding close_indicator
# 1: A 2000 200 0
# 2: A 2001 100 0
# 3: A 2002 400 0
# 4: A 2003 0 1
# 5: A 2004 196 -1
# 6: B 2004 200 0
# 7: B 2005 200 0
# 8: B 2006 0 1
# 9: B 2007 0 0
# 10: C 2000 200 0
# 11: C 2001 1200 0
# 12: B 2008 1200 -1
# 13: B 2010 100 0
# 14: B 2011 0 1
# 15: A 2016 0 1
# 16: D 2009 60 0
# 17: D 2010 0 1
# 18: D 2016 0 0
答案 1 :(得分:0)
使用data.table
。我不确定我是否完全理解你的标准,但我认为下面的例子应该足以让你起步。这里应该帮助你的主要功能是shift function in data table。与分组操作(使用by = .(Name)
子句)相结合,您可以为以前的余额添加一列。
创建该列后,您可以根据您的条件使用复合逻辑在相关行上添加所需的标记。
library(data.table)
DT <- data.table(Name = c("A", "A", "A", "A", "A","A", "A", "A", "B", "B","B", "B", "B"),
Date = c(2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2000, 2001, 2002, 2003),
Outstanding = c(200, 100, 600 ,400, 0, 196, 200, 0, 500, 600, 0, 200, 0))
setkey(DT,Name,Date)
## Add a new column for previous outstanding balance
DT[,Prev_Outstanding := shift(Outstanding, n = 1L, fill = NA, type = "lag"), by = .(Name)]
DT[,CloseIndicator := 0] ## Pre-fill all rows with 0 initially
DT[Prev_Outstanding > 0 & Outstanding == 0, CloseIndicator := 1, by = .(Name)] ## Mark account closings
DT[Prev_Outstanding == 0 & Outstanding > 0, CloseIndicator := -1, by = .(Name)] ## Mark Account re-openings
print(DT)
收率:
Name Date Outstanding Prev_Outstanding CloseIndicator
1: A 2000 200 NA 0
2: A 2001 100 200 0
3: A 2002 600 100 0
4: A 2003 400 600 0
5: A 2004 0 400 1
6: A 2005 196 0 -1
7: A 2006 200 196 0
8: A 2007 0 200 1
9: B 2000 600 NA 0
10: B 2001 0 600 1
11: B 2002 200 0 -1
12: B 2003 0 200 1
13: B 2008 500 0 -1