删除仅一个变量的重复项

时间:2017-04-03 12:59:37

标签: r dplyr

我有一个数据框,我想通过删除重复项来压缩,但只有某个变量。在下面的示例中,我只想删除user_idplan_type = subscriber的重复项。 output下面显示了如何压缩样本数据的unique()

我已经尝试了user_id但是它不起作用,因为可能会多次出现同一plan_type = PPG > foo user_id plan_type 16435 6264 subscriber 31518 10050 subscriber 31520 10050 subscriber 7576 11174 subscriber 19744 11186 subscriber 19745 11186 subscriber 46108 20348 subscriber 5293 31641 subscriber 5294 31641 subscriber 5295 31641 PPU > output user_id plan_type 16435 6264 subscriber 31520 10050 subscriber 7576 11174 subscriber 19745 11186 subscriber 46108 20348 subscriber 5294 31641 subscriber 5295 31641 PPU > dput(foo) structure(list(user_id = c(6264L, 10050L, 10050L, 11174L, 11186L, 11186L, 20348L, 31641L, 31641L, 31641L), plan_type = c("subscriber", "subscriber", "subscriber", "subscriber", "subscriber", "subscriber", "subscriber", "subscriber", "subscriber", "PPU")), .Names = c("user_id", "plan_type"), row.names = c(16435L, 31518L, 31520L, 7576L, 19744L, 19745L, 46108L, 5293L, 5294L, 5295L), class = "data.frame") ,并且此数据应该保留。

任何建议都不包括子集化然后重新绑定两个数据帧的多个步骤?

$(document).ready(function() {

  $("#container").click(function(e) {
    alert("#container clicked");
    e.stopPropagation();
  });

  $("#content_text").click(function(e) {
    e.stopPropagation();
    alert("#content_text clicked");
  });

  $("#content").click(function(e) {
    e.stopPropagation();
    alert("#content clicked");
  });

});

3 个答案:

答案 0 :(得分:6)

您希望仅对user_id 重复 plan_type “订阅者的观察进行子集化“:

foo[!duplicated(foo$user_id) | foo$plan_type != "subscriber", ]

使用dplyr,这将是

library(dplyr)
foo %>% filter(!duplicated(user_id) | plan_type != "subscriber")

答案 1 :(得分:1)

我们可以创建一个逻辑

subset(foo, (!duplicated(user_id) & plan_type == "subscriber")|
         plan_type %in% setdiff(unique(plan_type), "subscriber"))
#       user_id  plan_type
#16435    6264 subscriber
#31518   10050 subscriber
#7576    11174 subscriber
#19744   11186 subscriber
#46108   20348 subscriber
#5293    31641 subscriber
#5295    31641       PPU

data.table

library(data.table)
rbind(unique(setDT(foo), by = "user_id"), foo[plan_type!= "subscriber"])
#    user_id  plan_type
#1:    6264 subscriber
#2:   10050 subscriber
#3:   11174 subscriber
#4:   11186 subscriber
#5:   20348 subscriber
#6:   31641 subscriber
#7:   31641        PPU

答案 2 :(得分:0)

将您的输入视为df

df <-  read.table(text = '  indx    user_id  plan_type
16435    6264 subscriber
31518   10050 subscriber
31520   10050 subscriber
7576    11174 subscriber
19744   11186 subscriber
19745   11186 subscriber
46108   20348 subscriber
5293    31641 subscriber
5294    31641 subscriber
5295    31641        PPU', header = T, stringsAsFactors = F)

你可以尝试:

df %>% 
  group_by(plan_type, user_id) %>%
  slice(which.max(indx))

给出:

Source: local data frame [7 x 3]
Groups: plan_type, user_id [7]

   indx user_id  plan_type
  <int>   <int>      <chr>
1  5295   31641        PPU
2 16435    6264 subscriber
3 31520   10050 subscriber
4  7576   11174 subscriber
5 19745   11186 subscriber
6 46108   20348 subscriber
7  5294   31641 subscriber

如果您愿意,也可以包含plan_type == "subscriber"过滤器,但根据给定的示例,它不会产生任何影响。

这可以这样做:

df %>% 
  filter( plan_type == "subscriber") %>%
  group_by(user_id) %>%
  slice(which.max(indx)) %>%
  bind_rows(df %>% filter(plan_type != "subscriber"))