我的df如下:
mov
我想做两件事。首先,计算具有苹果和橙子(即2玛丽和约翰)的独特观察的数量。
之后,我想将它们从我的数据框中删除,这样我只留下了只有苹果的独特个体。
这是我尝试过的
data
names fruit
7 john apple
13 john orange
14 john apple
2 mary orange
5 mary apple
8 mary orange
10 mary apple
12 mary apple
1 tom apple
6 tom apple
真的,我想使用grepl,因为我的真实数据比水果更复杂。这就是我尝试过的(首先转换为data.table)
toremove<-unique(data[data$fruit=='apple' & data$fruit=='orange',"names"]) ##this part doesn't work, if it had I would have used the below code to remove the names identified
data2<-data[!data$names %in% toremove,]
所以,总而言之,我的问题在于识别同时拥有苹果和橙子的人。这看起来很简单,所以请随意指导我一个可以教我这个的资源!
所需的输出
data1<-data.table(data1)
z<-data1[,ind := grepl('app.*? & orang.*?', fruit), by='names'] ## this works fine when i just use 'app.*?' but collapses when I try to add the & sign, so I'm making an error with the operator. In addition the by='names' doesn't work out for me, which is important. My plan here was to create an indicator (if an individual has an apple and an orange, then they get an indicator==1 and I would then filter them out on the basis of this indicator).
答案 0 :(得分:6)
如果您只查找仅包含apple
的名称,则此处采用简单的data.table
方法
setDT(data)[ , if(all(fruit == "apple")) .SD, by = names]
# names fruit
# 1: tom apple
# 2: tom apple
对于同时具有&#34; apple&#34;和&#34;橙&#34;伯爵,你可以做点什么
data[, any(fruit == "apple") & any(fruit == "orange"), by = names][, sum(V1)]
## [1] 2
最后,如果您只想找到只有一个唯一fruit
的用户,则可以尝试使用devel version on GH(或uniqueN
)<{1}}中的length(unique())
进行调节/ p>
data[, if(uniqueN(fruit) < 2L) .SD, by = names]
# names fruit
# 1: tom apple
# 2: tom apple
答案 1 :(得分:0)
我使用dplyr包来标记/发现使用橙子的用户和使用这两种水果的用户。 (我在最后添加了一行以获得仅有橙色的案例。)
data =
read.table(text="
names fruit
7 john apple
13 john orange
14 john apple
2 mary orange
5 mary apple
8 mary orange
10 mary apple
12 mary apple
1 tom apple
6 tom apple
21 kathy orange", header=T)
# names fruit
# 7 john apple
# 13 john orange
# 14 john apple
# 2 mary orange
# 5 mary apple
# 8 mary orange
# 10 mary apple
# 12 mary apple
# 1 tom apple
# 6 tom apple
# 21 kathy orange
library(dplyr)
data %>%
group_by(names) %>% # for each user name
mutate(N_dist = n_distinct(fruit), # count distinct number of fruits
N_oranges = sum(fruit=="orange")) %>% # count number of oranges
filter(N_oranges == 0 & N_dist < 2) %>% # keep users with no oranges and no both fruits
select(names, fruit)
# names fruit
# 1 tom apple
# 2 tom apple
请注意,在应用过滤器之前,您的数据集如下所示:
# names fruit N_dist N_oranges
# 1 john apple 2 1
# 2 john orange 2 1
# 3 john apple 2 1
# 4 mary orange 2 2
# 5 mary apple 2 2
# 6 mary orange 2 2
# 7 mary apple 2 2
# 8 mary apple 2 2
# 9 tom apple 1 0
# 10 tom apple 1 0
# 11 kathy orange 1 1
您可以从中获得具有水果的唯一名称或具有橙子的用户。