我有一组空间坐标(x,y)数据,在几年的过程中每个坐标都有一个响应变量。以下代码生成类似的数据框:
df <- data.frame(
id = rep(1:2, 2),
x = rep(c(25, 30),10),
y = rep(c(100, 200), 10),
year = rep(1980:1989, 2),
response = rnorm(20)
)
结果数据框:
head(df)
id x y year response
1 1 25 100 1980 0.1707431
2 2 30 200 1981 1.3562263
3 1 25 100 1982 -0.4590506
4 2 30 200 1983 1.3238410
5 1 25 100 1984 1.7765772
6 2 30 200 1985 -0.6258069
我希望通过时间对每个单元格运行线性回归以获得响应变量的变化。为此,我使用plyr和ddply命令:
require(plyr)
lm.df <- ddply(df, .(id), function(z)coef(lm(response ~ year, data = z)))
然后我重新组合了空间数据(这里的重组示例有点简单,但适用于该示例。):
points <- data.frame(
id = c(1,2),
x = c(25,30),
y = c(100,200)
)
lm.stack <- merge(points, lm.df, by="id")
colnames(lm.stack) <- c("ID", "x", "y", "intercept", "slope")
print(lm.stack)
ID x y intercept slope
1 1 25 100 257.7291 -0.12985632
2 2 30 200 173.3676 -0.08708068
这很有效。但我想要的是能够提取每个lm模型的相等平方值,以便能够通过坐标单元来表示显着趋势,其最终目标是投影由显着性值着色的细胞图。这是我需要帮助的地方,我非常感激。谢谢!
答案 0 :(得分:6)
我建议使用包dplyr
和broom
的组合。
这种方法将作为数据帧返回您需要了解的关于模型的所有信息(关于系数的信息和关于模型本身的信息):
set.seed(25)
df <- data.frame(
id = rep(1:2, 2),
x = rep(c(25, 30),10),
y = rep(c(100, 200), 10),
year = rep(1980:1989, 2),
response = rnorm(20)
)
df
# id x y year response
# 1 1 25 100 1980 -0.21183360
# 2 2 30 200 1981 -1.04159113
# 3 1 25 100 1982 -1.15330756
# 4 2 30 200 1983 0.32153150
# 5 1 25 100 1984 -1.50012988
# 6 2 30 200 1985 -0.44553326
# 7 1 25 100 1986 1.73404543
# 8 2 30 200 1987 0.51129562
# 9 1 25 100 1988 0.09964504
# 10 2 30 200 1989 -0.05789111
# 11 1 25 100 1980 -1.74278763
# 12 2 30 200 1981 -1.32495298
# 13 1 25 100 1982 -0.54793388
# 14 2 30 200 1983 -1.45638428
# 15 1 25 100 1984 0.08268682
# 16 2 30 200 1985 0.92757895
# 17 1 25 100 1986 -0.71676933
# 18 2 30 200 1987 0.96239968
# 19 1 25 100 1988 1.54588458
# 20 2 30 200 1989 -1.00976361
library(dplyr)
library(broom)
df %>%
group_by(id) %>%
do({model = lm(response~year, data=.) # create your model
data.frame(tidy(model), # get coefficient info
glance(model))}) # get model info
# id term estimate std.error statistic p.value r.squared adj.r.squared sigma statistic.1 p.value.1 df logLik AIC BIC deviance df.residual
# 1 1 (Intercept) -492.2144842 213.19252113 -2.308779 0.04978386 0.3996362 0.32459069 0.9611139 5.325253 0.04987162 2 -12.67704 31.35409 32.26184 7.389919 8
# 2 1 year 0.2479705 0.10745580 2.307651 0.04987162 0.3996362 0.32459069 0.9611139 5.325253 0.04987162 2 -12.67704 31.35409 32.26184 7.389919 8
# 3 2 (Intercept) -258.6253012 196.88284243 -1.313600 0.22539989 0.1771294 0.07427055 0.8871395 1.722063 0.22582607 2 -11.87614 29.75227 30.66003 6.296132 8
# 4 2 year 0.1301582 0.09918521 1.312274 0.22582607 0.1771294 0.07427055 0.8871395 1.722063 0.22582607 2 -11.87614 29.75227 30.66003 6.296132 8
如果您真的希望在最终数据框中使用x和y列,则可以使用group_by(id,x,y)
。例如,如果您想要一个类似的输出,您可以这样做:
library(dplyr)
library(broom)
library(tidyr)
dd =
df %>%
group_by(id, x, y) %>%
do({model = lm(response~year, data=.) # create your model
data.frame(tidy(model), # get coefficient info
glance(model))}) # get model info
然后选择您需要的信息:
dd %>%
select(id, x, y, term, estimate, adj.r.squared)
# id x y term estimate adj.r.squared
# 1 1 25 100 (Intercept) -492.2144842 0.32459069
# 2 1 25 100 year 0.2479705 0.32459069
# 3 2 30 200 (Intercept) -258.6253012 0.07427055
# 4 2 30 200 year 0.1301582 0.07427055
你可以得到一行截距和一行斜率。
或者,甚至将该数据框重塑为:
dd %>%
select(id, x, y, term, estimate, adj.r.squared) %>%
spread(term, estimate)
# id x y adj.r.squared (Intercept) year
# 1 1 25 100 0.32459069 -492.2145 0.2479705
# 2 2 30 200 0.07427055 -258.6253 0.1301582
列year
是slope
(变量系数year
)
以类似的方式,您还可以使用包purrr
创建一个新的数据框,其中包含具有相应信息的列表列:
set.seed(25)
df <- data.frame(
id = rep(1:2, 2),
x = rep(c(25, 30),10),
y = rep(c(100, 200), 10),
year = rep(1980:1989, 2),
response = rnorm(20)
)
library(dplyr)
library(broom)
library(tidyr)
library(purrr)
dd = df %>%
group_by(id, x, y) %>% # for each combination of those variables
nest() %>% # nest data (rest of columns)
mutate(Model = map(data, ~lm(response~year, data=.)), # use nested data to build model of interest
Coeff_Info = map(Model, tidy), # get coefficient info
Model_Info = map(Model, glance)) %>% # get model info
ungroup() # forget the grouping
# check how new dataset looks like
dd
# # A tibble: 2 x 7
# id x y data Model Coeff_Info Model_Info
# <int> <dbl> <dbl> <list> <list> <list> <list>
# 1 1 25 100 <tibble [10 x 2]> <S3: lm> <data.frame [2 x 5]> <data.frame [1 x 11]>
# 2 2 30 200 <tibble [10 x 2]> <S3: lm> <data.frame [2 x 5]> <data.frame [1 x 11]>
您仍然可以访问所需的任何元素,但请记住,您的某些列现在具有列表元素:
# get coefficient info for both models
dd$Coeff_Info
# [[1]]
# term estimate std.error statistic p.value
# 1 (Intercept) -492.2144842 213.1925211 -2.308779 0.04978386
# 2 year 0.2479705 0.1074558 2.307651 0.04987162
#
# [[2]]
# term estimate std.error statistic p.value
# 1 (Intercept) -258.6253012 196.88284243 -1.313600 0.2253999
# 2 year 0.1301582 0.09918521 1.312274 0.2258261
# get the r squared value for 1st model
dd %>% # from new dataset
filter(id == 1) %>% # keep all info / rows of 1st model
pull(Model_Info) %>% # get model info
map_dbl("r.squared") # show r squared
# [1] 0.3996362
或者
dd %>% # from new dataset
unnest(id, Model_Info) %>% # umnest id and model info
filter(id == 1) %>% # keep row of 1st model
pull(r.squared) # show the r squared
# [1] 0.3996362
答案 1 :(得分:1)
如果将模型保留在列表中,则可以使用lapply
或任何其他列表遍历函数轻松地从中提取任何内容。
lms <- dlply(df, .(id), function(x) lm(response ~ year, data=x))
rsq <- lapply(lms, function(x) summary(x)$adj.r.squared) # get the adjusted r2
自己制作列表的另一种方法是lmList
nlme
library(nlme)
lms2 <- lmList(response ~ year | id, data=df)
summary(lms2)$adj.r.squared