Question

我的数据框有10列和100,000行，每行是观察，列是与每个观察有关的数据。其中一列具有朱利安日的观察日期（即第4天=第34天）。我想减少我的数据集，所以我有每年PER物种的前10％观察值。也就是说，对于1901年的物种1，我想要基于前10％观测值的平均出现日。

我拥有的例子：注意id =种类但是作为数字。即蓝色= 1

date=c(3,84,98,100,34,76,86...)
species=c(blue,purple,grey,purple,green,pink,pink,white...)
id=c(1,2,3,2,4,5,5,6...)
year=c(1901,2000,1901,1996,1901,2000,1986...)  
habitat=c(forest,plain,mountain...)

ECT 我想要的是：日期= C（3,84,76,86 ...）种类= c（紫色，粉红色，粉红色，白色......） ID = C（2,5,5,6 ...）年= C（1901,2000,2000,1986 ...）
栖息地= C（森林，平原，山区......）新= C（3,84,79,86 ...）

Answer 1

假设下面定义了数据集dd

set.seed(123)
n <- 100000
dd <- data.frame(year = sample(1901:2000, n, replace = TRUE), 
                 date = sample(0:364, n, replace = TRUE),
                 species = sample(1:5, n, replace = TRUE))

1）base 使用指定的函数聚合dd。没有使用包裹：

avg10 <- function(date) {
  ok <- seq_along(date) <= length(date) / 10
  if (any(ok)) mean(date[ok]) else NA
}
aggregate(date ~ species + year, dd, avg10)

2）data.table 这是一个data.table解决方案：

data.table(dd)[, 
  {ok <- .I <= .10 * .N; if (any(ok)) mean(date[ok]) else NA}, by = "species,year"]

注意：如果您不想要NA，那么请使用此代替上述任一if语句来获得第一点：

  if (any(ok)) mean(date[ok]) else date[1]

Answer 2

与您的last question一样，dplyr可能适合您：

一些数据：

library(dplyr)
set.seed(42)
n <- 500
dat <- data.frame(date = sample(365, size=n, replace=TRUE),
                  species = sample(5, size=n, replace=TRUE),
                  year = 1980 + sample(20, size=n, replace=TRUE))

没有过滤的样子：

dat %>% group_by(year, species) %>% arrange(year, date)
## Source: local data frame [500 x 3]
## Groups: year, species
##    date species year
## 1    50       1 1981
## 2   138       1 1981
## 3   174       1 1981
## 4   179       1 1981
## 5   200       1 1981
## 6   332       1 1981
## 7    31       2 1981
## 8    52       2 1981
## 9   196       2 1981
## 10  226       2 1981
## ..  ...     ...  ...

每年的前10％按日期显示：

dat %>%
    group_by(year, species) %>%
    filter(ntile(date, 10) == 1) %>%
    arrange(year, date)
## Source: local data frame [100 x 3]
## Groups: year, species
##    date species year
## 1    50       1 1981
## 2    31       2 1981
## 3    63       3 1981
## 4   112       4 1981
## 5     1       5 1981
## 6    40       1 1982
## 7   103       2 1982
## 8    40       3 1982
## 9    86       4 1982
## 10   48       5 1982
## ..  ...     ...  ...

我认为 ntile技巧正在做你想要的事情：它将数据分成大致相等大小的分档，所以它应该给你最低10％的日期。

<强> 修改

抱歉，我错过了那里的mean：

dat %>% group_by(year, species) %>%
    filter(ntile(date, 10) == 1) %>%
    summarise(date = mean(date)) %>%
    arrange(year, date)
## Source: local data frame [99 x 3]
## Groups: year
##    year species date
## 1  1981       5    1
## 2  1981       2   31
## 3  1981       1   50
## 4  1981       3   63
## 5  1981       4  112
## 6  1982       1   40
## 7  1982       3   40
## 8  1982       5   48
## 9  1982       4   86
## 10 1982       2  103
## ..  ...     ...  ...

R：初始隔离10％

2 个答案: