如何在R中使用较少的值减少一些观察?

时间:2018-04-29 16:13:01

标签: r database

我想创建一个简单的例子。也许这么简单,但我不知道如何为它编写代码。

有一个面板数据集,其中包含两个变量datecompany以及其他一些变量:

date <- c(1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,6,6,6,6,6)
company <-c("a","b","c","d","e","a","b","c","d","a","b","a","b","c","a","b","c","a","b","c","d","e")

并非每家公司每天都有交易。所以我只想保持与已经交易过的公司相关的数据超过4次。在这个例子中,我有6天和5家公司。公司“e”和“d”应成为要删除的公司。

1 个答案:

答案 0 :(得分:2)

一种选择是将dplyr::filtergroup_by一起使用。 n()提供group_by项的行数。因此,n()将在group_by上应用company后返回公司交易的次数。

#data
date <- c(1,1,1,1,1,2,2,2,2,3,3,4,4,4,5,5,5,6,6,6,6,6)
company <-c("a","b","c","d","e","a","b","c","d","a","b","a","b","c","a",
           "b","c","a","b","c","d","e")
df <- data.frame(date, company)

library(dplyr)

df %>% group_by(company) %>%
  filter(n() > 4)            #subset companies traded for more than 4 times

#Result: e & d not appearing as for them count (n()) was less than 4
# # A tibble: 17 x 2
# # Groups: company [3]
# date company
# <dbl> <fctr> 
#   1  1.00 a      
# 2  1.00 b      
# 3  1.00 c      
# 4  2.00 a      
# 5  2.00 b      
# 6  2.00 c      
# 7  3.00 a      
# 8  3.00 b      
# 9  4.00 a      
# 10  4.00 b      
# 11  4.00 c      
# 12  5.00 a      
# 13  5.00 b      
# 14  5.00 c      
# 15  6.00 a      
# 16  6.00 b      
# 17  6.00 c