我有一个带有PatientID和日期的数据框,按ID中的日期顺序排序。每个病人通常只有几行,尽管可能只有一行。 例如:
patid date
1302 2009-01-27
1302 2009-02-05
1302 2009-08-28
1670 2009-03-12
2073 2009-04-03
2073 2010-11-01
2073 2010-12-19
2073 2011-03-06
由此,我想生成一个数据帧或CSV文件,其中包含每个患者的开始日期和结束日期,因此从上面开始,我将拥有
patid start end
1302 2009-01-27 2009-08-28
1670 2009-03-12 2009-03-12
2073 2009-04-03 2011-03-06
我的初始文件中有超过3000万行,所以我不想写for
循环。
我想知道是否存在一种有效的方法,也许是从使用aggregate
开始为每位患者得出行数?
答案 0 :(得分:1)
使用--add-modules java.se.ee
:
输入数据:
sqldf
代码
df=read.table(text="patid date
1302 2009-01-27
1302 2009-02-05
1302 2009-08-28
1670 2009-03-12
2073 2009-04-03
2073 2010-11-01
2073 2010-12-19
2073 2011-03-06",header=T)
输出:
library(sqldf)
sqldf("select patid,min(date) as start, max(date) as end from df group by patid")
答案 1 :(得分:1)
使用tidyverse
:
read.table(text="patid date
1302 2009-01-27
1302 2009-02-05
1302 2009-08-28
1670 2009-03-12
2073 2009-04-03
2073 2010-11-01
2073 2010-12-19
2073 2011-03-06",header=T)%>%
group_by(patid)%>%
mutate(date=lubridate::ymd(date))%>%
summarise(start=min(date),
end=max(date))
# A tibble: 3 x 3
patid start end
<int> <date> <date>
1 1302 2009-01-27 2009-08-28
2 1670 2009-03-12 2009-03-12
3 2073 2009-04-03 2011-03-06
答案 2 :(得分:0)
aggregate()
和FUN =一个简单的自定义函数,可以一步返回两个输出min()
和max()
的向量:按照您的建议,您可以使用aggregate()
-但如下所示,您可以一步完成每个min()
组的max()
和patid
计算
# Read in your sample data, being careful to prevent dates from becoming factors
pdates <-
read.table( text="patid date
1302 2009-01-27
1302 2009-02-05
1302 2009-08-28
1670 2009-03-12
2073 2009-04-03
2073 2010-11-01
2073 2010-12-19
2073 2011-03-06",
header=TRUE,
stringsAsFactors=FALSE) # keep date strings from becoming factors!
aggregate( x = pdates["date"], # dataframe with column(s) to aggregate
by = pdates["patid"], # passing dataframe with named column "patid" preserves the column name in the output
FUN = function(vdate) {
c(start=min(vdate), end=max(vdate))
}
)
patid date.start date.end
1 1302 2009-01-27 2009-08-28
2 1670 2009-03-12 2009-03-12
3 2073 2009-04-03 2011-03-06
range()
函数:aggregate( pdates["date"], by=pdates["patid"], range)
patid date.1 date.2
1 1302 2009-01-27 2009-08-28
2 1670 2009-03-12 2009-03-12
3 2073 2009-04-03 2011-03-06