我有一个数据框,我想通过删除重复项来压缩,但只有某个变量。在下面的示例中,我只想删除user_id
时plan_type = subscriber
的重复项。 output
下面显示了如何压缩样本数据的unique()
。
我已经尝试了user_id
但是它不起作用,因为可能会多次出现同一plan_type = PPG
> foo
user_id plan_type
16435 6264 subscriber
31518 10050 subscriber
31520 10050 subscriber
7576 11174 subscriber
19744 11186 subscriber
19745 11186 subscriber
46108 20348 subscriber
5293 31641 subscriber
5294 31641 subscriber
5295 31641 PPU
> output
user_id plan_type
16435 6264 subscriber
31520 10050 subscriber
7576 11174 subscriber
19745 11186 subscriber
46108 20348 subscriber
5294 31641 subscriber
5295 31641 PPU
> dput(foo)
structure(list(user_id = c(6264L, 10050L, 10050L, 11174L, 11186L,
11186L, 20348L, 31641L, 31641L, 31641L), plan_type = c("subscriber",
"subscriber", "subscriber", "subscriber", "subscriber", "subscriber",
"subscriber", "subscriber", "subscriber", "PPU")), .Names = c("user_id",
"plan_type"), row.names = c(16435L, 31518L, 31520L, 7576L, 19744L,
19745L, 46108L, 5293L, 5294L, 5295L), class = "data.frame")
,并且此数据应该保留。
任何建议都不包括子集化然后重新绑定两个数据帧的多个步骤?
$(document).ready(function() {
$("#container").click(function(e) {
alert("#container clicked");
e.stopPropagation();
});
$("#content_text").click(function(e) {
e.stopPropagation();
alert("#content_text clicked");
});
$("#content").click(function(e) {
e.stopPropagation();
alert("#content clicked");
});
});
答案 0 :(得分:6)
您希望仅对user_id
不重复 plan_type
“订阅者的观察进行子集化“:
foo[!duplicated(foo$user_id) | foo$plan_type != "subscriber", ]
使用dplyr,这将是
library(dplyr)
foo %>% filter(!duplicated(user_id) | plan_type != "subscriber")
答案 1 :(得分:1)
我们可以创建一个逻辑
subset(foo, (!duplicated(user_id) & plan_type == "subscriber")|
plan_type %in% setdiff(unique(plan_type), "subscriber"))
# user_id plan_type
#16435 6264 subscriber
#31518 10050 subscriber
#7576 11174 subscriber
#19744 11186 subscriber
#46108 20348 subscriber
#5293 31641 subscriber
#5295 31641 PPU
或data.table
library(data.table)
rbind(unique(setDT(foo), by = "user_id"), foo[plan_type!= "subscriber"])
# user_id plan_type
#1: 6264 subscriber
#2: 10050 subscriber
#3: 11174 subscriber
#4: 11186 subscriber
#5: 20348 subscriber
#6: 31641 subscriber
#7: 31641 PPU
答案 2 :(得分:0)
将您的输入视为df
df <- read.table(text = ' indx user_id plan_type
16435 6264 subscriber
31518 10050 subscriber
31520 10050 subscriber
7576 11174 subscriber
19744 11186 subscriber
19745 11186 subscriber
46108 20348 subscriber
5293 31641 subscriber
5294 31641 subscriber
5295 31641 PPU', header = T, stringsAsFactors = F)
你可以尝试:
df %>%
group_by(plan_type, user_id) %>%
slice(which.max(indx))
给出:
Source: local data frame [7 x 3]
Groups: plan_type, user_id [7]
indx user_id plan_type
<int> <int> <chr>
1 5295 31641 PPU
2 16435 6264 subscriber
3 31520 10050 subscriber
4 7576 11174 subscriber
5 19745 11186 subscriber
6 46108 20348 subscriber
7 5294 31641 subscriber
如果您愿意,也可以包含plan_type == "subscriber"
过滤器,但根据给定的示例,它不会产生任何影响。
这可以这样做:
df %>%
filter( plan_type == "subscriber") %>%
group_by(user_id) %>%
slice(which.max(indx)) %>%
bind_rows(df %>% filter(plan_type != "subscriber"))