data.frame变量从哪一行具有常量值

时间:2016-10-19 18:21:58

标签: r time-series dplyr

我想从另一个变量开始具有常量值的行中计算R中data.frame上变量的平均值。我通常使用dplyr进行这种数据库任务,但我不知道如何做到这一点,这里有一个例子:

s<-"no Spc PSize
2                0           6493
2                0           9281
2               12          26183
2               12          36180
2               12          37806
2               12          37765
3               12          36015
3               12          26661
3                0          14031
3                0           5564
3                1          17701
3                1          20808
3                1          31511
3                1          44746
3                1          50534
3                1          54858
3                1          58160
3                1          60326"

d<-read.delim(textConnection(s),sep="",header=T)

mean(d[1:10,3])
sd(d[1:10,3])

从第11行开始,变量spc有一个常量值,所以这就是我要分割data.frame的地方

mean(d[11:18,3])
sd(d[11:18,3])

我可以手工计算,但这不是主意......

3 个答案:

答案 0 :(得分:2)

选项1:使用rleid包中的data.table

d %>% 
  group_by(rlid = rleid(Spc)) %>% 
  summarise(mean_size = mean(PSize), sd_size = sd(PSize)) %>% 
  slice(n())

给出:

# A tibble: 1 × 3
   rlid mean_size  sd_size
  <int>     <dbl>    <dbl>
1     4   42330.5 16866.59

选项2:使用rle

startrow <- sum(head(rle(d$Spc)$lengths, -1)) + 1
d %>%
  slice(startrow:n()) %>% 
  summarise(mean_size = mean(PSize), sd_size = sd(PSize))

给出:

  mean_size  sd_size
1   42330.5 16866.59

选项3 :如果要计算两个组(最后一组和其他组),则应使用group_by而不是filter并创建新的分组向量({{ 1}})rep_vec

rle

给出:

rep_vec <- c(sum(head(rle(d$Spc)$lengths, -1)), tail(rle(d$Spc)$lengths, 1))

d %>%
  group_by(grp = rep(c('others','last_group'), rep_vec)) %>% 
  summarise(mean_size = mean(PSize), sd_size = sd(PSize))

如果要包含行,可以将代码更改为:

         grp mean_size  sd_size
       (chr)     (dbl)    (dbl)
1 last_group   42330.5 16866.59
2     others   23597.9 13521.32

给出:

d %>%
  mutate(rn = row_number()) %>% 
  group_by(grp = rep(c('others','last_group'), rep_vec)) %>% 
  summarise(mean_size = mean(PSize), sd_size = sd(PSize), rows = paste0(range(rn), collapse=':'))

答案 1 :(得分:1)

您可以通过添加一个列来检查条目是否与上述值匹配,然后使用cumsum查找计数更改的位置。我group_by那个,并计算了你想要的摘要 - 我还添加了一个输出,其中包含哪些行来说明它从哪里获取。

d %>%
  mutate(
    row = 1:n()
    , isDiff = Spc != lag(Spc, default = Spc[1])
    , whichGroup = cumsum(isDiff)) %>%
  group_by(whichGroup, Spc) %>%
  summarise(mean = mean(PSize)
            , sd = sd(PSize)
            , whichRows = paste(range(row), collapse = ":"))

给出:

  whichGroup   Spc    mean        sd whichRows
       <int> <int>   <dbl>     <dbl>     <chr>
1          0     0  7887.0  1971.414       1:2
2          1    12 33435.0  5486.794       3:8
3          2     0  9797.5  5987.073      9:10
4          3     1 42330.5 16866.591     11:18

如果您只想要最后一组,如果您这样做,我无法通过您的帖子告诉您,您可以使用filter,如下所示:

d %>%
  mutate(
    row = 1:n()
    , isDiff = Spc != lag(Spc, default = Spc[1])
    , whichGroup = cumsum(isDiff)) %>%
  filter(whichGroup == max(whichGroup)) %>%
  summarise(Spc = Spc[1]
            , mean = mean(PSize)
            , sd = sd(PSize)
            , whichRows = paste(range(row), collapse = ":"))

给出了:

  Spc    mean       sd whichRows
1   1 42330.5 16866.59     11:18

根据评论,您似乎想要最后一组与其他组,您可以通过这种方法获得:

d %>%
  mutate(
    row = 1:n()
    , isDiff = Spc != lag(Spc, default = Spc[1])
    , whichGroup = cumsum(isDiff)) %>%
  group_by(isLast = whichGroup == max(whichGroup)) %>%
  summarise(mean = mean(PSize)
            , sd = sd(PSize)
            , whichRows = paste(range(row), collapse = ":"))

给出:

  isLast    mean       sd whichRows
   <lgl>   <dbl>    <dbl>     <chr>
1  FALSE 23597.9 13521.32      1:10
2   TRUE 42330.5 16866.59     11:18

答案 2 :(得分:0)

所以你想找到中间向量开始不变的索引?您可以使用向量的diff()并首次查找与零不同的值。例如,

vec <- c(1,2,3,4,5,5,5,6,6,6)
diff(vec)
differences <- rev(diff(vec))

# distance from the end of first non-zero
min.dist <- min(which(differences != 0))

# take difference
length(vec) - min.dist + 1

最后一个值应该为你提供矢量开始恒定的索引。