通过在每行上使用来自多个数据集的输入运行函数来创建新列

时间:2017-11-16 14:37:18

标签: r function merge

我想聚合som数据,结合两个数据集lsr和依从性(再现性的例子):

adherence <- cbind.data.frame(c("1", "2", "3", "1", "2", "3"), c("2013.1", "2013.1", "2013.1", "2013.2", "2013.2", "2013.2"))
library(dplyr)
library(tidyr)
names(adherence)[1] <- "ID" 
names(adherence)[2] <- "year"
adherence$year <- as.numeric(as.character(adherence$year))

lsr <- cbind.data.frame(
        c("1", "1", "1", "2", "2", "2", "3", "3"), #ID
        c("2012.3", "2012.8", "2013.1","2012.8", "2013.3", "2013.9", "2011", "2013"), #eksd
        c("60", "90", "90", "60", "120", "60", "30", "90") # DDD
        )
names(lsr)[1] <- "ID"
names(lsr)[2] <- "eksd"
names(lsr)[3] <- "DDD"

lsr$eksd <- as.numeric(as.character(lsr$eksd))
lsr$DDD <- as.numeric(as.character(lsr$DDD))
lsr$ENDDATE <- lsr$eksd + lsr$DDD/365.25

因此,在依从性数据集中,我想要一个新列,其中包含来自lsr数据集的信息,具体取决于依从性数据集中每行的ID变量和年变量。我做了这个功能,我认为这样做:

function.DDAV <- function() {
        Y <- lsr %>% #dummy variable
          filter(., .$ID == adherence$ID) %>% #filters lsr by ID from relevant row in adherence dataset
          filter(., .$eksd <= adherence$year & adherence$year <= .$ENDDATE) # filters further to include relevant time period
          Y$DIFF <- (Y$ENDDATE - adherence$year)*365.25 # calculates relevant drug dosage available for implementation of dosis regimen at time = adhererence$year i.e. uses the time from adherence dataset and compares with time until dose runs out
          z <- sum(Y$DIFF) #sum available drugs doses
          return(z) # returns sum to be inputted in one row in new column in adherence dataset
      }

我希望对依从性数据集中的每一行应用一次,使用相关的行值创建一个新列。

我尝试使用apply系列函数和聚合函数。我已经挣扎了几天,我得到了不同的错误,我还没有表现出来,因为我认为我可能会考虑使用这个问题都错了?搜索时,我只能找到有关在一个数据集中应用整个行或列的问题。

编辑我在下面显示了理想的输出: 理想的输出如下:

遵守$ adherence&lt; - as.numeric(c(&#34; 90&#34;,&#34; 0&#34;,&#34; 53.475&#34;,&#34; 16.95&# 34;,&#34; 120&#34;&#34; 0&#34))

在本质上,过滤后计算很简单:如果在过滤后包含一行,则从第一行开始计算这个值,并遵守$ ID = 1和遵守$ year = 2013.1。只剩下ID = 1的最后一行,总和是(2013.1(遵守$ year)-2013.346(lsr $ ENDDATE))* 365.25 = 90.

2 个答案:

答案 0 :(得分:1)

我并非100%确定这是你所追求的,但希望它是一个起点。

我的理解是,对于IDacceptance的每次出现,您希望将其与lsr中的所有ID进行比较,以及year大于或等于eksd且小于ENDDATE您希望使用剩余的ENDDATE进行计算。

以下应该可以做到这一点,newVariable中的结果与理想输出中的结果略有不同,因为对于与ID = 2和年份= 2013.2相关的值,不存在eksd中ID = 2的lsr值小于或等于year因此,而不是120,我得到0。

 adherence <- cbind.data.frame(c("1", "2", "3", "1", "2", "3"), c("2013.1", "2013.1", "2013.1", "2013.2", "2013.2", "2013.2"))
library(dplyr)
library(tidyr)
names(adherence)[1] <- "ID" 
names(adherence)[2] <- "year"
adherence$year <- as.numeric(as.character(adherence$year))

lsr <- cbind.data.frame(
     c("1", "1", "1", "2", "2", "2", "3", "3"), #ID
     c("2012.3", "2012.8", "2013.1","2012.8", "2013.3", "2013.9", "2011", "2013"), #eksd
     c("60", "90", "90", "60", "120", "60", "30", "90") # DDD
)
names(lsr)[1] <- "ID"
names(lsr)[2] <- "eksd"
names(lsr)[3] <- "DDD"

lsr$eksd <- as.numeric(as.character(lsr$eksd))
lsr$DDD <- as.numeric(as.character(lsr$DDD))
lsr$ENDDATE <- lsr$eksd + lsr$DDD/365.25

adherence %>% full_join(lsr, by = 'ID')  %>% mutate(newVariable = ifelse(eksd <= year & year <= ENDDATE, (ENDDATE - year) * 365.25, 0)) %>% 
     group_by(ID, year) %>% summarize(newVariable = sum(newVariable)) 

      ID   year newVariable
  <fctr>  <dbl>       <dbl>
1      1 2013.1      90.000
2      1 2013.2      53.475
3      2 2013.1       0.000
4      2 2013.2       0.000
5      3 2013.1      53.475
6      3 2013.2      16.950

答案 1 :(得分:0)

行。所以上面的解决方案让很多资源在大型数据集上运行,所以我最终得到了这个解决方案:

function.AH <- function(x) {
  A <- subset(lsr, ID == x[1] & eksd <= x[2] & ENDDATE > x[2]) 
  if (nrow(A) == 0) {
    0
  } else {
   sum(A$ENDDATE-as.numeric(x[2]))*365.25
  } 
}

apply (adherence, 1,  FUN = function.AH)