Question

我试图在数据集中为一个变量的每个四分位拟合一个生存模型。以Max( case when [LIMS Pending_Complete]~="LIMS Pending" then "LIMS Pending" when [LIMS Pending_Complete]~="LIMS Complete" then "LIMS Complete" end) OVER ([Batch Number])包中提供的弓形癌数据集为例

survival

但是我得到一个关于变量长度不同的错误。我想要适合四个模型，每个四分位一个。

小于或等于25％

大于25且小于等于50％

大于50％且小于等于75％

大于75％

我怎么做？

Answer 1

默认情况下，quantile会在prob = seq(0, 1, 0.25)返回5个值。我想你想用cut来获得一个因子变量：

library(survival)
datalung <- lung
datalung$fage <- with(datalung, cut(age, quantile(age), include = TRUE))

## don't use `attach()`; use the `data` argument of model fitting routine
fit <- survfit(Surv(time,status) ~ fage, data = datalung, type="kaplan-meier")

#Call: survfit(formula = Surv(time, status) ~ fage, data = datalung, 
#    type = "kaplan-meier")
#
#              n events median 0.95LCL 0.95UCL
#fage=[39,56] 58     39    337     239     457
#fage=(56,63] 59     41    348     245     574
#fage=(63,69] 55     39    329     285     477
#fage=(69,82] 56     46    283     222     361

<强>后续

@ 42-也使用quantile，他获得的是＆＃34;左关闭和右开...＆＃34;间隔。

你的问题是：

小于或等于25％
大于25且小于等于50％
大于50％且小于等于75％
大于75％

很明显你想要＆＃34;保持开放和右边关闭＆＃34;间隔。因此，我的代码正是您想要的。

What is the meaning of include.lowest in reclassify raster package详细解释了include.lowest和right内的cut和raster::reclassify个参数。现在让我们进行比较：

## my factor
table(with(datalung, cut(age, quantile(age), include.lowest = TRUE)))
#[39,56] (56,63] (63,69] (69,82] 
#     58      59      55      56 

## 42-'s factor
table(with(datalung, cut(age, quantile(age), include.lowest = TRUE, right = FALSE)))
#[39,56) [56,63) [63,69) [69,82] 
#     49      57      55      67

Answer 2

我尝试使用我首选的创建四分位数指标的方法：

library(survival)
datalung <- lung
detach(datalung)  # Agree with Zheyuan Li that attach()-ing is dangerous practice.
fit3<- survfit(Surv(time,status) ~ findInterval(age, quantile(age)[-5]), 
                     data=datalung, type = "kaplan-meier")

需要删除向量中的第五个项目是拆分值，因为findInterval具有在左侧关闭的拆分，并且将获得仅具有最大年龄的第五个组。请注意，我们的四分位数计数结果不同。他的方法丢失了案件，而不仅仅是最小或最大的群体。他们去了哪里，......我还不确定：

> fit3
Call: survfit(formula = Surv(time, status) ~ findInterval(age, quantile(age)[-5]), 
    data = datalung, type = "kaplan-meier")

                                        n events median 0.95LCL 0.95UCL
findInterval(age, quantile(age)[-5])=1 49     32    320     226     533
findInterval(age, quantile(age)[-5])=2 57     41    340     245     433
findInterval(age, quantile(age)[-5])=3 55     39    310     267     524
findInterval(age, quantile(age)[-5])=4 67     53    285     229     363

你对Zheyuan Li关于ggplot中级别顺序的问题暴露了使用cut的另一个缺陷，至少如果没有提供带有“label”参数的名称。级别按词汇顺序排列，“[”为>而不是“（”：

> levels(datalung$fage)
[1] "[39,56]" "(56,63]" "(63,69]" "(69,82]"
> "[" < "("
[1] FALSE

为了解决我使用分位数与@ZheyuanLi使用的问题以及他对我的方法的错误描述，只需要检查：

> quantile(datalung$age)
  0%  25%  50%  75% 100% 
  39   56   63   69   82 

> with( datalung, table( findInterval(age, quantile(datalung$age)[-5] )))

 1  2  3  4 
49 57 55 67

因此，大部分区别在于如何处理56岁：

>  sum(lung$age==56)
[1] 9

在使用cut()时试图解决标签问题（无论如何，这不是我的责任，不是吗？）：

> library(ggplot2)  # checked to make sure I have the most recent version per CRAN
> autoplot(fit2)
Error: Objects of type survfit not supported by autoplot.

如何为变量的每个四分位拟合生存模型？

2 个答案: