我有一个每天有多个土壤测量数据表。土壤湿度范围为0-0.8,并且还有一些NA:
set.seed(24)
df1 <- data.frame(date = sample(seq(as.Date("2015-01-01"),
length.out = 365, by = "1 day"), 5e1, replace = TRUE),
sm = sample(c(NA, runif(10, min=0, max=0.8)), 5e1, replace = TRUE))
我试图按月计算以下统计数据:
0 to 0.2
,0.2 to 0.4
,0.4 to 0.6
和0.6 to 0.8
)。在提供的示例df1
中,1月有五个测量值。五分之一是NA,因此NA应该总共20%。还有0.13
,它适合0-0.2
类。因此,20%。有两个0.23
值,位于0.2-0.4
类中,因此为50%。最终的0.68
值将转到0.6-0.8
类,这是1月份总数的20%。
这是预期的结果:
month NA 0-0.2 0.2-0.4 0.4-0.6 0.6-0.8
1 20% 20% 40% 0% 20%
2 0% 0% 50% 25% 25%
3 0% 0% 16.6% 16.6% 66.8%
...
我尝试计算1.
失败的原因如下:
DT[, .(percentage = 100 * sum(is.na(.SD))/length(.SD)), by=month(DT$date)]
但它会产生一些无意义的百分比值。
关于如何到达那里的任何想法?谢谢!
答案 0 :(得分:0)
我们可以尝试使用tidyverse
。将“日期”转换为Date
类(如果尚未),从“日期”中提取month
,根据“sm”列创建一个cut
的分组变量,按'month'和'grp'获取每个组的元素数量(n()
)并除以每个月的总行数,并spread
将其划分为“宽”格式
library(tidyverse)
df1 %>%
group_by(month = month(date)) %>%
mutate(n = n()) %>%
group_by(grp = cut(sm, breaks = seq(0, 0.8, by = 0.2)), add = TRUE) %>%
summarise(perc = 100 * n()/first(n)) %>%
spread(grp, perc, fill = 0)
# A tibble: 12 x 6
# Groups: month [12]
# month `(0,0.2]` `(0.2,0.4]` `(0.4,0.6]` `(0.6,0.8]` `<NA>`
# * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1.00 20.0 40.0 0 20.0 20.0
# 2 2.00 0 50.0 25.0 25.0 0
# 3 3.00 0 16.7 16.7 66.7 0
# 4 4.00 14.3 42.9 42.9 0 0
# 5 5.00 33.3 16.7 0 50.0 0
# 6 6.00 0 100 0 0 0
# 7 7.00 0 66.7 0 0 33.3
# 8 8.00 20.0 60.0 20.0 0 0
# 9 9.00 14.3 28.6 28.6 14.3 14.3
#10 10.0 50.0 50.0 0 0 0
#11 11.0 0 100 0 0 0
#12 12.0 0 33.3 66.7 0 0
或使用data.table
library(data.table)
tmp <- setDT(df1)[, n := .N, month(ymd(date))][, .(perc = 100 * .N/n[1]),
by = .(month = month(ymd(date)),
grp = cut(sm, breaks = seq(0, 0.8, by = 0.2),
labels = c('0 - 0.2', '0.2 - 0.4', '0.4 - 0.6', '0.6 - 0.8')))]
dcast(tmp, month ~ grp, value.var = 'perc')
set.seed(24)
df1 <- data.frame(date = sample(seq(as.Date("2015-01-01"),
length.out = 365, by = "1 day"), 3e4, replace = TRUE),
sm = sample(c(NA, rnorm(10)), 3e4, replace = TRUE))