我正在使用lifelines
库来估计Cox PH模型。对于回归,我具有许多分类特征,我对它们进行了一次热编码并删除了每个特征的一列,以避免出现多重共线性问题(虚拟变量陷阱)。我没有附加代码,因为该示例可能与文档here中给出的示例相似。
通过运行cph.check_assumptions(data)
,我收到有关每个虚拟变量都违反假设的信息:
Variable 'dummy_a' failed the non-proportional test: p-value is 0.0063.
Advice: with so few unique values (only 2), you can try `strata=['dummy_a']` in the call in `.fit`. See documentation in link [A] and [B] below.
我应该如何针对单个分类功能的多个虚拟变量理解建议?我应该将它们全部添加到地层吗?
我将不胜感激:)
答案 0 :(得分:1)
@abu,您的问题在文档中引起了明显的空白-虚拟变量的操作违反了比例检验。在这种情况下,我建议 not 虚拟变量,然后将原始列添加为分层变量,例如:library(dplyr)
anim <- polls_ %>%
arrange(week) %>%
ggplot(aes(week, resultados, group = partidos)) +
geom_line() +
geom_segment(aes(xend = as.POSIXct("2019-03-08 00:00:00", tz="UTC"), yend = resultados),
linetype = 2, colour = 'grey') +
geom_point(size = 2) +
geom_text(aes(x = as.POSIXct("2019-03-15 00:00:00", tz="UTC"), label = partidos),
hjust = 0) +
transition_reveal(week) +
coord_cartesian(clip = 'off') +
labs(title = 'Opinion polling for the 2019 Spanish general election',
y = 'Estimated results', x = 'week') +
theme_minimal() +
theme(plot.margin = margin(5.5, 40, 5.5, 5.5))
animate(anim, width = 900, height = 600,
end_pause = 10,
fps = 10, rewind = FALSE, duration = 15)