在R中选择列值= x的第一个唯一观测值

时间:2013-06-30 02:00:58

标签: r binary unique

我想确定在规定的时间范围内获得苹果的独特人物。我通过如下创建二进制指示符“apples”来做到这一点。

names<-c("tom", "mary", "tom", "john", "mary", "tom", "john", "mary", "john", "mary", "tom", "mary", "john", "john")
dates<-as.Date(c("2010-02-01", "2010-05-01", "2010-03-01", "2010-07-01", "2010-07-01", "2010-06-01", "2010-09-01", "2010-07-01", "2010-11-01", "2010-09-01", "2010-08-01", "2010-11-01", "2010-12-01", "2011-01-01"))
fruit<-as.character(c("apple", "orange", "banana", "kiwi", "apple", "apple", "apple", "orange", "banana", "apple", "kiwi", "apple", "orange", "apple"))
age<-as.numeric(c(60,55,60,57,55,60,57,55,57,55,60,55, 57,57))
sex<-as.character(c("m","f","m","m","f","m","m", "f","m","f","m","f","m", "m"))
df<-data.frame(names,dates, age, sex, fruit)
df


df$apples<-ifelse(df$fruit=='apple' & df$dates>="2010-04-01" & df$dates<"2010-10-01",1,0)
df

 names      dates age sex  fruit apples
1    tom 2010-02-01  60   m  apple      0
2   mary 2010-05-01  55   f orange      0
3    tom 2010-03-01  60   m banana      0
4   john 2010-07-01  57   m   kiwi      0
5   mary 2010-07-01  55   f  apple      1
6    tom 2010-06-01  60   m  apple      1
7   john 2010-09-01  57   m  apple      1
8   mary 2010-07-01  55   f orange      0
9   john 2010-11-01  57   m banana      0
10  mary 2010-09-01  55   f  apple      1
11   tom 2010-08-01  60   m   kiwi      0
12  mary 2010-11-01  55   f  apple      0
13  john 2010-12-01  57   m orange      0
14  john 2011-01-01  57   m  apple      0

我的问题是玛丽在那里两次。我只想要在指定的时间范围内获得苹果的第一个日期(并且每个人都会在真实数据中首次约会)。我想要一个名为“apples1”的第二列,它在定义的时间范围内标记每个人的初始日期,他们得到了一个苹果。

期望的输出:

 names      dates age sex  fruit apples apples1
1    tom 2010-02-01  60   m  apple      0       0
2   mary 2010-05-01  55   f orange      0       0
3    tom 2010-03-01  60   m banana      0       0
4   john 2010-07-01  57   m   kiwi      0       0
5   mary 2010-07-01  55   f  apple      1       1
6    tom 2010-06-01  60   m  apple      1       1
7   john 2010-09-01  57   m  apple      1       1
8   mary 2010-07-01  55   f orange      0       0
9   john 2010-11-01  57   m banana      0       0
10  mary 2010-09-01  55   f  apple      1       0
11   tom 2010-08-01  60   m   kiwi      0       0
12  mary 2010-11-01  55   f  apple      0       0
13  john 2010-12-01  57   m orange      0       0
14  john 2011-01-01  57   m  apple      0       0

我一直在寻找,最接近的是这个 - Select only the first rows for each unique value of a column in R。但这并不能解决独特的问题。我也遇到了!重复,但我不想删除玛丽的数据,因为我需要她的约会以继续跟进她。我可能在这里遗漏了一些非常重要的事情,提前道歉。

2 个答案:

答案 0 :(得分:1)

library(plyr)
df <- df[order(df$dates), ]
ddply(df, "names", transform, 
  apple1 = as.numeric(!duplicated(fruit) & fruit == "apple")
)

注意:我假设ddply在按分割变量分割时保留数据帧的排序。根据我的经验,您可以通过将transform更改为执行排序子句的内联函数来稍微修改此解决方案,我认为这不是必需的。

答案 1 :(得分:1)

这是一个data.table解决方案。我在同一时间创建了2列。

DT <- data.table(df)
setkeyv(DT,c("names","dates"))
DT[ fruit == "apple" & 
    dates >= "2010-04-01" & 
    dates <  "2010-10-01",
    `:=`(c('apples','apples1') ,
         list(1,
         {ifelse(!duplicated(names),1,0)}))
         ]

   names      dates age sex  fruit apples apples1
 1:  john 2010-07-01  57   m   kiwi     NA      NA
 2:  john 2010-09-01  57   m  apple      1       1
 3:  john 2010-11-01  57   m banana     NA      NA
 4:  john 2010-12-01  57   m orange     NA      NA
 5:  john 2011-01-01  57   m  apple     NA      NA
 6:  mary 2010-05-01  55   f orange     NA      NA
 7:  mary 2010-07-01  55   f  apple      1       1
 8:  mary 2010-07-01  55   f orange     NA      NA
 9:  mary 2010-09-01  55   f  apple      1       0
10:  mary 2010-11-01  55   f  apple     NA      NA
11:   tom 2010-02-01  60   m  apple     NA      NA
12:   tom 2010-03-01  60   m banana     NA      NA
13:   tom 2010-06-01  60   m  apple      1       1
14:   tom 2010-08-01  60   m   kiwi     NA      NA