Question

我一直致力于计算医院感染率的文件。我想将感染率标准化为年度手术计数。 data are located here，因为dput太大了。 SSI是手术感染的数量（1 =感染，0 =未感染），程序是手术的类型。年份是使用lubridate

派生的

library(plyr)


fname <- "https://raw.github.com/johnmarquess/some.data/master/hospG.csv"
download.file(fname, destfile='hospG.csv', method='wget')
hospG <- read.csv('hospG.csv')

Inf_table <- ddply(hospG, "Year", summarise, 
      Infections = sum(SSI == 1),
      Procedures = length(Procedure),
      PropInf = round(Infections/Procedures * 100 ,2)
)

这给了我这家医院每年感染的感染次数，手术次数和比例。

我想要的是标准比例被感染的额外列。在inf_table之外执行此操作的好方法是：

s1 <- sum(Inf_table$Infections)
s2 <- sum(Inf_table$Procedures)

Expected_prop_inf <- Inf_table$Procedures * s1/s2

有没有办法让ddply这样做。我把计算函数绑定到Expected_prop_inf，但我没有走得太远。

感谢您提供的任何帮助。

Answer 1

使用ddply会更加困难，因为您要在分组之外除以一个数字。最好用基础R来做。

# base
> with(Inf_table, Procedures*(sum(Infections)/sum(Procedures)))
[1] 17.39184 17.09623 23.00847 20.84065 24.83141 24.83141

而不是ddply，这不是那么自然：

# NB note .(Year) is unique for every row, you might also use rownames
> s1 <- sum(Inf_table$Infections)
> s2 <- sum(Inf_table$Procedures)
> ddply(Inf_table, .(Year), summarise, Procedures*(s1/s2))
  Year      ..1
1 2001 17.39184
2 2002 17.09623
3 2003 23.00847
4 2004 20.84065
5 2005 24.83141
6 2006 24.83141

Answer 2

以下是使用data.table汇总的解决方案。我不确定是否可以一步到位。

require("data.table")

fname <- "https://raw.github.com/johnmarquess/some_data/master/hospG.csv"
hospG <- read.csv(fname)

Inf_table <- DT[, {Infections = sum(SSI == 1)
                   Procedures = length(Procedure)
                   PropInf = round(Infections/Procedures * 100 ,2)
                   list(
                     Infections = Infections,
                     Procedures = Procedures,
                     PropInf = PropInf
                   )
                   }, by = Year]


Inf_table[,Expected_prop_inf := list(Procedures * sum(Infections)/sum(Procedures))]

tables()

这种方法的另一个好处是你没有在第二步中创建另一个data.table，而是创建了一个data.table的新列。如果您的数据集更大，这将是相关的。

计算边际总数作为ddply调用中的函数

2 个答案: