我想从另一个变量开始具有常量值的行中计算R中data.frame上变量的平均值。我通常使用dplyr进行这种数据库任务,但我不知道如何做到这一点,这里有一个例子:
s<-"no Spc PSize
2 0 6493
2 0 9281
2 12 26183
2 12 36180
2 12 37806
2 12 37765
3 12 36015
3 12 26661
3 0 14031
3 0 5564
3 1 17701
3 1 20808
3 1 31511
3 1 44746
3 1 50534
3 1 54858
3 1 58160
3 1 60326"
d<-read.delim(textConnection(s),sep="",header=T)
mean(d[1:10,3])
sd(d[1:10,3])
从第11行开始,变量spc有一个常量值,所以这就是我要分割data.frame的地方
mean(d[11:18,3])
sd(d[11:18,3])
我可以手工计算,但这不是主意......
答案 0 :(得分:2)
选项1:使用rleid
包中的data.table
:
d %>%
group_by(rlid = rleid(Spc)) %>%
summarise(mean_size = mean(PSize), sd_size = sd(PSize)) %>%
slice(n())
给出:
# A tibble: 1 × 3
rlid mean_size sd_size
<int> <dbl> <dbl>
1 4 42330.5 16866.59
选项2:使用rle
:
startrow <- sum(head(rle(d$Spc)$lengths, -1)) + 1
d %>%
slice(startrow:n()) %>%
summarise(mean_size = mean(PSize), sd_size = sd(PSize))
给出:
mean_size sd_size
1 42330.5 16866.59
选项3 :如果要计算两个组(最后一组和其他组),则应使用group_by
而不是filter
并创建新的分组向量({{ 1}})rep_vec
:
rle
给出:
rep_vec <- c(sum(head(rle(d$Spc)$lengths, -1)), tail(rle(d$Spc)$lengths, 1))
d %>%
group_by(grp = rep(c('others','last_group'), rep_vec)) %>%
summarise(mean_size = mean(PSize), sd_size = sd(PSize))
如果要包含行,可以将代码更改为:
grp mean_size sd_size
(chr) (dbl) (dbl)
1 last_group 42330.5 16866.59
2 others 23597.9 13521.32
给出:
d %>%
mutate(rn = row_number()) %>%
group_by(grp = rep(c('others','last_group'), rep_vec)) %>%
summarise(mean_size = mean(PSize), sd_size = sd(PSize), rows = paste0(range(rn), collapse=':'))
答案 1 :(得分:1)
您可以通过添加一个列来检查条目是否与上述值匹配,然后使用cumsum
查找计数更改的位置。我group_by
那个,并计算了你想要的摘要 - 我还添加了一个输出,其中包含哪些行来说明它从哪里获取。
d %>%
mutate(
row = 1:n()
, isDiff = Spc != lag(Spc, default = Spc[1])
, whichGroup = cumsum(isDiff)) %>%
group_by(whichGroup, Spc) %>%
summarise(mean = mean(PSize)
, sd = sd(PSize)
, whichRows = paste(range(row), collapse = ":"))
给出:
whichGroup Spc mean sd whichRows
<int> <int> <dbl> <dbl> <chr>
1 0 0 7887.0 1971.414 1:2
2 1 12 33435.0 5486.794 3:8
3 2 0 9797.5 5987.073 9:10
4 3 1 42330.5 16866.591 11:18
如果您只想要最后一组,如果您这样做,我无法通过您的帖子告诉您,您可以使用filter
,如下所示:
d %>%
mutate(
row = 1:n()
, isDiff = Spc != lag(Spc, default = Spc[1])
, whichGroup = cumsum(isDiff)) %>%
filter(whichGroup == max(whichGroup)) %>%
summarise(Spc = Spc[1]
, mean = mean(PSize)
, sd = sd(PSize)
, whichRows = paste(range(row), collapse = ":"))
给出了:
Spc mean sd whichRows
1 1 42330.5 16866.59 11:18
根据评论,您似乎想要最后一组与其他组,您可以通过这种方法获得:
d %>%
mutate(
row = 1:n()
, isDiff = Spc != lag(Spc, default = Spc[1])
, whichGroup = cumsum(isDiff)) %>%
group_by(isLast = whichGroup == max(whichGroup)) %>%
summarise(mean = mean(PSize)
, sd = sd(PSize)
, whichRows = paste(range(row), collapse = ":"))
给出:
isLast mean sd whichRows
<lgl> <dbl> <dbl> <chr>
1 FALSE 23597.9 13521.32 1:10
2 TRUE 42330.5 16866.59 11:18
答案 2 :(得分:0)
所以你想找到中间向量开始不变的索引?您可以使用向量的diff()
并首次查找与零不同的值。例如,
vec <- c(1,2,3,4,5,5,5,6,6,6)
diff(vec)
differences <- rev(diff(vec))
# distance from the end of first non-zero
min.dist <- min(which(differences != 0))
# take difference
length(vec) - min.dist + 1
最后一个值应该为你提供矢量开始恒定的索引。