我想建立一批新的毒品使用者(Ray 2003)。我的原始数据集大约有1900万行,因此循环被证明是低效的。这是一个虚拟数据集(用水果而不是药物完成):
df2
names dates age sex fruit
1 tom 2010-02-01 60 m apple
2 mary 2010-05-01 55 f orange
3 tom 2010-03-01 60 m banana
4 john 2010-07-01 57 m kiwi
5 mary 2010-07-01 55 f apple
6 tom 2010-06-01 60 m apple
7 john 2010-09-01 57 m apple
8 mary 2010-07-01 55 f orange
9 john 2010-11-01 57 m banana
10 mary 2010-09-01 55 f apple
11 tom 2010-08-01 60 m kiwi
12 mary 2010-11-01 55 f apple
13 john 2010-12-01 57 m orange
14 john 2011-01-01 57 m apple
我已经确定了在04-2010和10-2010之间开了苹果的人:
temp2
names dates age sex fruit
6 tom 2010-06-01 60 m apple
5 mary 2010-07-01 55 f apple
7 john 2010-09-01 57 m apple
我想在原始DF中创建一个名为“index”的新列,这是一个人在定义的日期范围内开出药物的第一个日期。这就是我试图将日期从temp变为df $ index:
df2$index<-temp2$dates
df2$index<-df2$dates == temp2$dates
df2$index<-df2$dates %in% temp2$dates
df2$index<-ifelse(as.Date(df$dates)==as.Date(temp2$dates), as.Date(temp2$dates),NA)
我没有这样做 - 因为这些都不起作用。这是期望的输出。
df2
names dates age sex fruit index
1 tom 2010-02-01 60 m apple <NA>
2 mary 2010-05-01 55 f orange <NA>
3 tom 2010-03-01 60 m banana <NA>
4 john 2010-07-01 57 m kiwi <NA>
5 mary 2010-07-01 55 f apple 2010-07-01
6 tom 2010-06-01 60 m apple 2010-06-01
7 john 2010-09-01 57 m apple 2010-09-01
8 mary 2010-07-01 55 f orange <NA>
9 john 2010-11-01 57 m banana <NA>
10 mary 2010-09-01 55 f apple <NA>
11 tom 2010-08-01 60 m kiwi <NA>
12 mary 2010-11-01 55 f apple <NA>
13 john 2010-12-01 57 m orange <NA>
14 john 2011-01-01 57 m apple <NA>
一旦我有了所需的输出,我想追溯到索引日期,看看是否有人在过去180天内有一个苹果。如果他们没有苹果 - 我想保留它们。如果他们确实有一个苹果(例如汤姆)我想丢弃他。这是我在所需输出上尝试的代码:
df4<-df2[df2$fruit!='apple' & df2$index-180,]
df4<-df2[df2$fruit!='apple' & df2$dates<=df2$index-180,] ##neither work for me
我很感激任何有关这些问题的指导 - 甚至是我应该阅读的指导,以帮助我学习如何做到这一点。也许我的逻辑是有缺陷的,我的方法也行不通 - 请告诉我,如果是这样的话!先感谢您。
这是我的df:
names<-c("tom", "mary", "tom", "john", "mary",
"tom", "john", "mary", "john", "mary", "tom", "mary", "john", "john")
dates<-as.Date(c("2010-02-01", "2010-05-01", "2010-03-01",
"2010-07-01", "2010-07-01", "2010-06-01", "2010-09-01",
"2010-07-01", "2010-11-01", "2010-09-01", "2010-08-01",
"2010-11-01", "2010-12-01", "2011-01-01"))
fruit<-as.character(c("apple", "orange", "banana", "kiwi",
"apple", "apple", "apple", "orange", "banana", "apple",
"kiwi", "apple", "orange", "apple"))
age<-as.numeric(c(60,55,60,57,55,60,57,55,57,55,60,55, 57,57))
sex<-as.character(c("m","f","m","m","f","m","m",
"f","m","f","m","f","m", "m"))
df2<-data.frame(names,dates, age, sex, fruit)
df2
这是temp2:
data1<-df2[df2$fruit=="apple"& (df2$dates >= "2010-04-01" & df2$dates< "2010-10-01"), ]
index <- with(data1, order(dates))
temp<-data1[index, ]
dup<-duplicated(temp$names)
temp1<-cbind(temp,dup)
temp2<-temp1[temp1$dup!=TRUE,]
temp2$dup<-NULL
解
df2 <- df2[with(df2, order(names, dates)), ]
df2$first.date <- ave(df2$date, df2$name, df2$fruit,
FUN=function(dt) dt[dt <="2010-10-31" & dt>="2010-04-01"][1]) ##DWin code for assigning index date for each fruit in the pre-period
df2$x<-df2$fruit=='apple' & df2$dates>df2$first.date-180 & df2$dates<df2$first.date ##assigns TRUE to row that tom is not a new user
ids <- with(df2, unique(names[x == "TRUE"])) ##finding the id which has one value of true
new_users<-subset(df2, !names %in% ids) ##gets rid of id that has at least one value of true
答案 0 :(得分:4)
按名称和日期排序:
df <- df[with(df, order(names, dates)), ]
然后选择每个名字中的第一个日期:
df$first.date <- ave(df$date, df$name, FUN="[", 1)
现在你已经看到了完全可操作的死星之战的力量\ w \ w&#34;,呃,ave
- 功能。您已准备好在个人名称中选出第一个日期&#39;和&#39;水果&#39;在该日期范围内:
> df$first.date <- ave(df$date, df$name, df$fruit,
FUN=function(dt) dt[dt <="2010-10-31" & dt>="2010-04-01"][1] )
> df
names dates age sex fruit first.date
4 john 2010-07-01 57 m kiwi 2010-07-01
7 john 2010-09-01 57 m apple 2010-09-01
9 john 2010-11-01 57 m banana <NA>
13 john 2010-12-01 57 m orange <NA>
14 john 2011-01-01 57 m apple 2010-09-01
2 mary 2010-05-01 55 f orange 2010-05-01
5 mary 2010-07-01 55 f apple 2010-07-01
8 mary 2010-07-01 55 f orange 2010-05-01
10 mary 2010-09-01 55 f apple 2010-07-01
12 mary 2010-11-01 55 f apple 2010-07-01
1 tom 2010-02-01 60 m apple 2010-06-01
3 tom 2010-03-01 60 m banana <NA>
6 tom 2010-06-01 60 m apple 2010-06-01
11 tom 2010-08-01 60 m kiwi 2010-08-01
答案 1 :(得分:4)
由于您有1900万行,我认为您应该尝试data.table
解决方案。在这里我的尝试。结果与@Dwin结果略有不同,因为我在(开始,结束)之间过滤我的数据,然后我创建了一个新的索引变量,它是每个(名称,水果)在此选定范围内出现的最小日期
library(data.table)
DT <- data.table(df2,key=c('names','dates'))
DT[,dates := as.Date(dates)]
DT[between(dates,as.Date("2010-04-01"),as.Date("2010-10-31")),
index := as.character(min(dates))
, by=c('names','fruit')]
## names dates age sex fruit index
## 1: john 2010-07-01 57 m kiwi 2010-07-01
## 2: john 2010-09-01 57 m apple 2010-09-01
## 3: john 2010-11-01 57 m banana NA
## 4: john 2010-12-01 57 m orange NA
## 5: john 2011-01-01 57 m apple NA
## 6: mary 2010-05-01 55 f orange 2010-05-01
## 7: mary 2010-07-01 55 f apple 2010-07-01
## 8: mary 2010-07-01 55 f orange 2010-05-01
## 9: mary 2010-09-01 55 f apple 2010-07-01
## 10: mary 2010-11-01 55 f apple NA
## 11: tom 2010-02-01 60 m apple NA
## 12: tom 2010-03-01 60 m banana NA
## 13: tom 2010-06-01 60 m apple 2010-06-01
## 14: tom 2010-08-01 60 m kiwi 2010-08-01