我想聚合som数据,结合两个数据集lsr和依从性(再现性的例子):
adherence <- cbind.data.frame(c("1", "2", "3", "1", "2", "3"), c("2013.1", "2013.1", "2013.1", "2013.2", "2013.2", "2013.2"))
library(dplyr)
library(tidyr)
names(adherence)[1] <- "ID"
names(adherence)[2] <- "year"
adherence$year <- as.numeric(as.character(adherence$year))
lsr <- cbind.data.frame(
c("1", "1", "1", "2", "2", "2", "3", "3"), #ID
c("2012.3", "2012.8", "2013.1","2012.8", "2013.3", "2013.9", "2011", "2013"), #eksd
c("60", "90", "90", "60", "120", "60", "30", "90") # DDD
)
names(lsr)[1] <- "ID"
names(lsr)[2] <- "eksd"
names(lsr)[3] <- "DDD"
lsr$eksd <- as.numeric(as.character(lsr$eksd))
lsr$DDD <- as.numeric(as.character(lsr$DDD))
lsr$ENDDATE <- lsr$eksd + lsr$DDD/365.25
因此,在依从性数据集中,我想要一个新列,其中包含来自lsr数据集的信息,具体取决于依从性数据集中每行的ID变量和年变量。我做了这个功能,我认为这样做:
function.DDAV <- function() {
Y <- lsr %>% #dummy variable
filter(., .$ID == adherence$ID) %>% #filters lsr by ID from relevant row in adherence dataset
filter(., .$eksd <= adherence$year & adherence$year <= .$ENDDATE) # filters further to include relevant time period
Y$DIFF <- (Y$ENDDATE - adherence$year)*365.25 # calculates relevant drug dosage available for implementation of dosis regimen at time = adhererence$year i.e. uses the time from adherence dataset and compares with time until dose runs out
z <- sum(Y$DIFF) #sum available drugs doses
return(z) # returns sum to be inputted in one row in new column in adherence dataset
}
我希望对依从性数据集中的每一行应用一次,使用相关的行值创建一个新列。
我尝试使用apply系列函数和聚合函数。我已经挣扎了几天,我得到了不同的错误,我还没有表现出来,因为我认为我可能会考虑使用这个问题都错了?搜索时,我只能找到有关在一个数据集中应用整个行或列的问题。
编辑我在下面显示了理想的输出: 理想的输出如下:
遵守$ adherence&lt; - as.numeric(c(&#34; 90&#34;,&#34; 0&#34;,&#34; 53.475&#34;,&#34; 16.95&# 34;,&#34; 120&#34;&#34; 0&#34))
在本质上,过滤后计算很简单:如果在过滤后包含一行,则从第一行开始计算这个值,并遵守$ ID = 1和遵守$ year = 2013.1。只剩下ID = 1的最后一行,总和是(2013.1(遵守$ year)-2013.346(lsr $ ENDDATE))* 365.25 = 90.
答案 0 :(得分:1)
我并非100%确定这是你所追求的,但希望它是一个起点。
我的理解是,对于ID
中acceptance
的每次出现,您希望将其与lsr中的所有ID
进行比较,以及year
大于或等于eksd
且小于ENDDATE
您希望使用剩余的ENDDATE
进行计算。
以下应该可以做到这一点,newVariable
中的结果与理想输出中的结果略有不同,因为对于与ID = 2和年份= 2013.2相关的值,不存在eksd
中ID = 2的lsr
值小于或等于year
因此,而不是120,我得到0。
adherence <- cbind.data.frame(c("1", "2", "3", "1", "2", "3"), c("2013.1", "2013.1", "2013.1", "2013.2", "2013.2", "2013.2"))
library(dplyr)
library(tidyr)
names(adherence)[1] <- "ID"
names(adherence)[2] <- "year"
adherence$year <- as.numeric(as.character(adherence$year))
lsr <- cbind.data.frame(
c("1", "1", "1", "2", "2", "2", "3", "3"), #ID
c("2012.3", "2012.8", "2013.1","2012.8", "2013.3", "2013.9", "2011", "2013"), #eksd
c("60", "90", "90", "60", "120", "60", "30", "90") # DDD
)
names(lsr)[1] <- "ID"
names(lsr)[2] <- "eksd"
names(lsr)[3] <- "DDD"
lsr$eksd <- as.numeric(as.character(lsr$eksd))
lsr$DDD <- as.numeric(as.character(lsr$DDD))
lsr$ENDDATE <- lsr$eksd + lsr$DDD/365.25
adherence %>% full_join(lsr, by = 'ID') %>% mutate(newVariable = ifelse(eksd <= year & year <= ENDDATE, (ENDDATE - year) * 365.25, 0)) %>%
group_by(ID, year) %>% summarize(newVariable = sum(newVariable))
ID year newVariable
<fctr> <dbl> <dbl>
1 1 2013.1 90.000
2 1 2013.2 53.475
3 2 2013.1 0.000
4 2 2013.2 0.000
5 3 2013.1 53.475
6 3 2013.2 16.950
答案 1 :(得分:0)
行。所以上面的解决方案让很多资源在大型数据集上运行,所以我最终得到了这个解决方案:
function.AH <- function(x) {
A <- subset(lsr, ID == x[1] & eksd <= x[2] & ENDDATE > x[2])
if (nrow(A) == 0) {
0
} else {
sum(A$ENDDATE-as.numeric(x[2]))*365.25
}
}
apply (adherence, 1, FUN = function.AH)