“新药用户”设计R.

时间:2013-07-14 00:16:13

标签: r date match population

我想建立一批新的毒品使用者(Ray 2003)。我的原始数据集大约有1900万行,因此循环被证明是低效的。这是一个虚拟数据集(用水果而不是药物完成):

    df2

   names      dates age sex  fruit
1    tom 2010-02-01  60   m  apple
2   mary 2010-05-01  55   f orange
3    tom 2010-03-01  60   m banana
4   john 2010-07-01  57   m   kiwi
5   mary 2010-07-01  55   f  apple
6    tom 2010-06-01  60   m  apple
7   john 2010-09-01  57   m  apple
8   mary 2010-07-01  55   f orange
9   john 2010-11-01  57   m banana
10  mary 2010-09-01  55   f  apple
11   tom 2010-08-01  60   m   kiwi
12  mary 2010-11-01  55   f  apple
13  john 2010-12-01  57   m orange
14  john 2011-01-01  57   m  apple

我已经确定了在04-2010和10-2010之间开了苹果的人:

temp2

  names      dates age sex fruit
6   tom 2010-06-01  60   m apple
5  mary 2010-07-01  55   f apple
7  john 2010-09-01  57   m apple

我想在原始DF中创建一个名为“index”的新列,这是一个人在定义的日期范围内开出药物的第一个日期。这就是我试图将日期从temp变为df $ index:

df2$index<-temp2$dates    
df2$index<-df2$dates == temp2$dates
df2$index<-df2$dates %in% temp2$dates
df2$index<-ifelse(as.Date(df$dates)==as.Date(temp2$dates), as.Date(temp2$dates),NA)

我没有这样做 - 因为这些都不起作用。这是期望的输出。

    df2

   names      dates age sex  fruit      index
1    tom 2010-02-01  60   m  apple       <NA>
2   mary 2010-05-01  55   f orange       <NA>
3    tom 2010-03-01  60   m banana       <NA>
4   john 2010-07-01  57   m   kiwi       <NA>
5   mary 2010-07-01  55   f  apple 2010-07-01
6    tom 2010-06-01  60   m  apple 2010-06-01
7   john 2010-09-01  57   m  apple 2010-09-01
8   mary 2010-07-01  55   f orange       <NA>
9   john 2010-11-01  57   m banana       <NA>
10  mary 2010-09-01  55   f  apple       <NA>
11   tom 2010-08-01  60   m   kiwi       <NA>
12  mary 2010-11-01  55   f  apple       <NA>
13  john 2010-12-01  57   m orange       <NA>
14  john 2011-01-01  57   m  apple       <NA>

一旦我有了所需的输出,我想追溯到索引日期,看看是否有人在过去180天内有一个苹果。如果他们没有苹果 - 我想保留它们。如果他们确实有一个苹果(例如汤姆)我想丢弃他。这是我在所需输出上尝试的代码:

df4<-df2[df2$fruit!='apple' & df2$index-180,]
df4<-df2[df2$fruit!='apple' & df2$dates<=df2$index-180,] ##neither work for me

我很感激任何有关这些问题的指导 - 甚至是我应该阅读的指导,以帮助我学习如何做到这一点。也许我的逻辑是有缺陷的,我的方法也行不通 - 请告诉我,如果是这样的话!先感谢您。

这是我的df:

names<-c("tom", "mary", "tom", "john", "mary",
 "tom", "john", "mary", "john", "mary", "tom", "mary", "john", "john")
dates<-as.Date(c("2010-02-01", "2010-05-01", "2010-03-01", 
"2010-07-01", "2010-07-01", "2010-06-01", "2010-09-01",
 "2010-07-01", "2010-11-01", "2010-09-01", "2010-08-01", 
"2010-11-01", "2010-12-01", "2011-01-01"))
fruit<-as.character(c("apple", "orange", "banana", "kiwi",
 "apple", "apple", "apple", "orange", "banana", "apple",
 "kiwi", "apple", "orange", "apple"))
age<-as.numeric(c(60,55,60,57,55,60,57,55,57,55,60,55, 57,57))
sex<-as.character(c("m","f","m","m","f","m","m",
 "f","m","f","m","f","m", "m"))
df2<-data.frame(names,dates, age, sex, fruit)
df2

这是temp2:

data1<-df2[df2$fruit=="apple"& (df2$dates >= "2010-04-01" & df2$dates<  "2010-10-01"), ]
index <- with(data1, order(dates))
temp<-data1[index, ] 
dup<-duplicated(temp$names)
temp1<-cbind(temp,dup)
temp2<-temp1[temp1$dup!=TRUE,]
temp2$dup<-NULL

df2 <- df2[with(df2, order(names, dates)), ]
df2$first.date <- ave(df2$date, df2$name, df2$fruit, 
       FUN=function(dt) dt[dt <="2010-10-31" & dt>="2010-04-01"][1])                   ##DWin code for assigning index date for each fruit in the pre-period

df2$x<-df2$fruit=='apple' & df2$dates>df2$first.date-180 & df2$dates<df2$first.date    ##assigns TRUE to row that tom is not a new user
ids <- with(df2, unique(names[x == "TRUE"]))                                           ##finding the id which has one value of true
new_users<-subset(df2, !names %in% ids)                                                       ##gets rid of id that has at least one value of true

2 个答案:

答案 0 :(得分:4)

按名称和日期排序:

df <- df[with(df, order(names, dates)), ]

然后选择每个名字中的第一个日期:

df$first.date <- ave(df$date, df$name, FUN="[", 1)

现在你已经看到了完全可操作的死星之战的力量\ w \ w&#34;,呃,ave - 功能。您已准备好在个人名称中选出第一个日期&#39;和&#39;水果&#39;在该日期范围内:

> df$first.date <- ave(df$date, df$name, df$fruit, 
         FUN=function(dt) dt[dt <="2010-10-31" & dt>="2010-04-01"][1] )
> df
   names      dates age sex  fruit first.date
4   john 2010-07-01  57   m   kiwi 2010-07-01
7   john 2010-09-01  57   m  apple 2010-09-01
9   john 2010-11-01  57   m banana       <NA>
13  john 2010-12-01  57   m orange       <NA>
14  john 2011-01-01  57   m  apple 2010-09-01
2   mary 2010-05-01  55   f orange 2010-05-01
5   mary 2010-07-01  55   f  apple 2010-07-01
8   mary 2010-07-01  55   f orange 2010-05-01
10  mary 2010-09-01  55   f  apple 2010-07-01
12  mary 2010-11-01  55   f  apple 2010-07-01
1    tom 2010-02-01  60   m  apple 2010-06-01
3    tom 2010-03-01  60   m banana       <NA>
6    tom 2010-06-01  60   m  apple 2010-06-01
11   tom 2010-08-01  60   m   kiwi 2010-08-01

答案 1 :(得分:4)

由于您有1900万行,我认为您应该尝试data.table解决方案。在这里我的尝试。结果与@Dwin结果略有不同,因为我在(开始,结束)之间过滤我的数据,然后我创建了一个新的索引变量,它是每个(名称,水果)在此选定范围内出现的最小日期

library(data.table)
DT <- data.table(df2,key=c('names','dates'))
DT[,dates := as.Date(dates)]
DT[between(dates,as.Date("2010-04-01"),as.Date("2010-10-31")),
   index := as.character(min(dates))
,   by=c('names','fruit')]
##     names      dates age sex  fruit      index
##  1:  john 2010-07-01  57   m   kiwi 2010-07-01
##  2:  john 2010-09-01  57   m  apple 2010-09-01
##  3:  john 2010-11-01  57   m banana         NA
##  4:  john 2010-12-01  57   m orange         NA
##  5:  john 2011-01-01  57   m  apple         NA
##  6:  mary 2010-05-01  55   f orange 2010-05-01
##  7:  mary 2010-07-01  55   f  apple 2010-07-01
##  8:  mary 2010-07-01  55   f orange 2010-05-01
##  9:  mary 2010-09-01  55   f  apple 2010-07-01
## 10:  mary 2010-11-01  55   f  apple         NA
## 11:   tom 2010-02-01  60   m  apple         NA
## 12:   tom 2010-03-01  60   m banana         NA
## 13:   tom 2010-06-01  60   m  apple 2010-06-01
## 14:   tom 2010-08-01  60   m   kiwi 2010-08-01