对此表示感谢,我正在尝试重新学习一些基础知识。
以下是针对我的问题的一些示例代码,它来自受伤工人的数据库。
Area <- c("Connecticut", "Maine", "Massachusetts", "New Hampshire", "Texas", "Arizona", "California", "Washington")
Region <- c("Northeast", "Northeast", "Northeast", "Northeast", "South", "South", "West", "West")
X2004 <- c(0,1,4,1,3,4,2,2)
X2005 <- c(1,0,6,2,0,1,0,2)
X2006 <- c(0,0,1,1,2,1,0,0)
df1 <- data.frame(Area, Region, X2004, X2005, X2006)
我想展示从2004-2005的两年平均水平到Base R的2006年的百分比变化。我能够使用tidyverse程序包解决此问题,但是感觉就像使用拐杖。这是我到目前为止的内容:
df2 <- reshape(df1,
idvar=c("Area"),
v.names="count",
varying=c("X2004","X2005","X2006"),
direction="long",
times=2004:2006,
timevar="year")
df3 <- df2 %>% group_by(Region, year) %>%
summarise(total_count = sum(count))
df3$pre <- ifelse(df3$year<=2005, 1, 0)
df3 %>%
group_by(Region) %>%
summarise(mean_count_pre = mean(total_count[pre==1]),
mean_count_post = mean(total_count[pre==0]),
pct_change = 100*(mean_count_post - mean_count_pre) / mean_count_pre)
关于不依靠tidyverse或dplyr解决该问题的任何想法?非常感谢您的帮助,我在tidyverse中学习了R,并且我试图更好地理解基础知识。
答案 0 :(得分:2)
使用您的df2
作为输入,我们只能以这种方式使用R基函数:
> # creating `total_count`
> df3<- df2
> df3$total_count <- with(df2, ave(count, Region, year, FUN="sum"))
>
> # creating `pre`
> df3$pre <- ifelse(df3$year<=2005, "pre", "post")
>
> # creating "mean_count_pre" and "mean_count_post"
> output <- aggregate(total_count ~ Region+pre, data=df3, FUN="mean")
> colnames(output)[3] <- "mean_count"
> output_wide <- reshape(output, v.names="mean_count", idvar="Region", timevar = "pre", direction = "wide")
>
> # creating `pct_change`
> output_wide <- transform(output_wide, pct_change=(mean_count.post-mean_count.pre)/mean_count.pre)
> output_wide
Region mean_count.post mean_count.pre pct_change
1 Northeast 2 7.5 -0.7333333
2 South 3 4.0 -0.2500000
3 West 0 3.0 -1.0000000
答案 1 :(得分:1)
考虑将aggregate
替换为group_by
和summarise
,并使用双重聚合对 Region 进行合并的前后计算。 within
和transform
都用于就地列分配,而setNames
用于重命名在聚合过程中无法完成的列。
Tidyverse
df3 <- df2 %>% group_by(Region, year) %>%
summarise(total_count = sum(count))
df3$pre <- ifelse(df3$year<=2005, 1, 0)
aggdf <- df3 %>%
group_by(Region) %>%
summarise(mean_count_pre = mean(total_count[pre==1]),
mean_count_post = mean(total_count[pre==0]),
pct_change = 100*(mean_count_post - mean_count_pre) / mean_count_pre)
基本R
df3_base <- setNames(aggregate(count~Region + year, df2, sum),
c("Region", "year", "total_count"))
df3_base <- within(df3_base, {
pre <- ifelse(year <= 2005, 1, 0)
count_pre <- ifelse(pre==1, total_count, NA)
count_post <- ifelse(pre==0, total_count, NA)
})
aggdf_base <- transform(setNames(merge(aggregate(count_pre ~ Region, df3_base, FUN = mean),
aggregate(count_post ~ Region, df3_base, FUN = mean),
by="Region"),
c("Region", "mean_count_pre", "mean_count_post")),
pct_change = 100*(mean_count_post - mean_count_pre) / mean_count_pre)
比较
identical(data.frame(aggdf), aggdf_base)
# [1] TRUE