有没有办法将这些代码变成R函数?

时间:2017-06-15 15:28:00

标签: r function

我对R不是很擅长,最近我一直在努力学习如何很好地编写函数。所以我有一段代码,如果我写在"非功能"它最终将超过一千行代码。问题在于,它实际上只有大约六行的独特"代码,但它运行在大型数据集的不同子集上。

df <- subset(data, FileName == "File Name" & Category == "Category Name" & Case == "Case Name")
df <- df %>% group_by(TestNum) %>% summarise(FileName = FileName[1], Version = Version[1], Measure = Measure[1], RepMean = mean(Value), Case = Case[1])
df <- df[c(2, 3, 4, 5, 1, 6)]
df$Sigma1 <- (df$RepMean > (mean(df$RepMean, na.rm=TRUE)) + sd(df$RepMean, na.rm=TRUE))|(df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - sd(df$RepMean, na.rm=TRUE))
df$Sigma2 <- (df$RepMean > (mean(df$RepMean,na.rm=TRUE)) + 2 * (sd(df$RepMean, na.rm=TRUE))) | (df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - 2 * (sd(df$RepMean, na.rm=TRUE)))
df$Sigma3 <- (df$RepMean > (mean(df$RepMean, na.rm=TRUE)) + 3 * (sd(df$RepMean, na.rm=TRUE))) | (df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - 3 * (sd(df$RepMean, na.rm=TRUE)))

原始数据集在FileName列中有6个唯一值,Category列中有7个,Case列中有4个,这意味着我创建了168具有这些代码行的唯一df数据框,我使用rbind.fill来创建单个数据框(&#34; StatTable&#34;)然后我将其运行:< / p>

LatestTestNum <- max(data$TestNum, na.rm=TRUE)
ControlTable <- subset(StatTable, (Sigma1 == "TRUE" | Sigma2 == "TRUE" | Sigma3 == "TRUE") & TestNum == LatestTestNum)
ControlTable <- ControlTable[, c("FileName, "Category", "Case", "Sigma1", "Sigma2", "Sigma3")]

ControlTable是我正在寻找的最终产品。

这是一个功能会大大减少我的乏味痛苦的东西吗?特别是当我想修改它的工作方式时,它需要手动更改每个df代码。

编辑:这里解释了原始数据集的每一列中的内容。

Column1:FileName -- The name of the file that the data comes from
Column2:Version -- The version of the software that data comes from
Column3:Category -- The particular data type measured
Column4:Value -- The value of the data
Column5:TestNum -- TestNum is an integer value associated with the version number. This makes it easier to organize and sort data rather than using the Version column which is a string. (So for example version 1.0 might be TestNum=1 and 1.1 TestNum=2)
Column6:RepNum -- The replication count of that version. (Files are run multiple times per version)
Column7:Case -- There are different ways that the software is "setup" for data collection.

这是一个有效的数据集。

FileName <- c("File1","File1","File1","File1","File2","File2","File2","File2","File1","File1","File1","File1","File2","File2","File2","File2","File1","File1","File1","File1","File2","File2","File2","File2","File1","File1","File1","File1","File2","File2","File2","File2")
Version <- c("1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2")
Category <- c("Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2")
Value <- rpois(n = 32, lambda = 100)
TestNum <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2)
RepNum <- c(1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2)
Case <- c("Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2")
df <- data.frame(FileName,Version,Category,Value,TestNum,RepNum,Case)

第二次编辑:我自己提供了一个答案,因为我已经制定了一个功能,可以完成导致各种&#34; df&#34的基本步骤;具有唯一的&#34; FileName&#34;,&#34;类别&#34;和&#34;案例&#34;。我仍然希望能够满足168个不同数据框架的需求,但我希望能够添加到此功能中的主要功能是能够过滤掉某些TestNum& #39;很容易。

例如,我的一个独特数据框最适合这个子集:

df <- subset(data, FileName == "File1" & Category == "Category1" & Case == "Case1" &
TestNum > 11)

但另一个数据框可能最适合这个子集:

df <- subset(data, FileName == "File1" & Category == "Category1" & Case == "Case1" &
(TestNum > 8 & TestNum != 21 & TestNum != 32))

我想我应该能够添加&#34; TestNum&#34;作为我的功能的另一个论点,但我不确定如何能够控制我可以过滤多少。

有趣的额外挑战:

因为我正在处理大量现有数据集,所以我调整了每个sigma值,以便从平均值和标准差计算中过滤出某些数据点(这对于实际使用是必要的)检测超出这些sigma值的新数据点 - 此代码的整个目的)。有没有办法编写一个也可以进行相同调整的函数?

2 个答案:

答案 0 :(得分:1)

对于168个唯一的FileName / Category / Case组合,在函数中使用dplyr方式似乎是完全自然的。第一组按FileName / Category / Case / TestNum获取你的RepMeans,然后按FileName / Category / Case分组并进行计算,得到它是1,2或3 SD。而不是你的比较代码,在这里我首先计算SD的数量,然后使用它,这感觉更自然,并且重复计算更少。

df %>% group_by(FileName, Category, Case, TestNum) %>%
       summarise(RepMean = mean(Value)) %>%
       group_by(FileName, Category, Case) %>%
       mutate(diff.sd = abs((RepMean - mean(RepMean, na.rm=TRUE))/sd(RepMean, na.rm=TRUE)),
              Sigma1 = diff.sd > 1,
              Sigma2 = diff.sd > 2,
              Sigma3 = diff.sd > 3)

对于您的其他子集,我认为将其简单地删除您不想要的行,而不是包含您执行的行,这似乎是最自然的。一旦您从完整数据集中删除它们,您就可以运行此代码。

编辑以演示输出:这里我通过向原始数据添加异常值来显示sigma值为1 +,2 +和3+的输出,并且还使其适应每个组中的更多数据点,并且只有一个每个TestNum。在这个版本中,我还输出组均值,sd和大小,以确保它们都能正常工作。

df$Case <- "Case1"
df$Category <- "Category1"
df$TestNum <- 1:nrow(df)
df$Value[1] <- 5000
df$Value[5] <- 140
out <- df %>% group_by(FileName, Category, Case, TestNum) %>%
       summarise(RepMean = mean(Value)) %>%
       group_by(FileName, Category, Case) %>%
           mutate(group.mean=mean(RepMean, na.rm=TRUE),
                  group.sd=sd(RepMean, na.rm=TRUE),
                  group.n=length(RepMean),
                  diff.sd = abs((RepMean - mean(RepMean, na.rm=TRUE))/sd(RepMean, na.rm=TRUE)),
              Sigma1 = diff.sd > 1,
              Sigma2 = diff.sd > 2,
              Sigma3 = diff.sd > 3)
head(out[order(-out$diff.sd),])
## Source: local data frame [6 x 12]
## Groups: FileName, Category, Case [2]
##  
##   FileName  Category  Case TestNum RepMean group.mean   group.sd group.n   diff.sd Sigma1 Sigma2 Sigma3
##     <fctr>     <chr> <chr>   <int>   <dbl>      <dbl>      <dbl>   <int>     <dbl>  <lgl>  <lgl>  <lgl>
## 1    File1 Category1 Case1       1    5000   402.4375 1226.06857      16 3.7498413   TRUE   TRUE   TRUE
## 2    File2 Category1 Case1       5     140   103.5625   13.29646      16 2.7403912   TRUE   TRUE  FALSE
## 3    File2 Category1 Case1      13      85   103.5625   13.29646      16 1.3960483   TRUE  FALSE  FALSE
## 4    File2 Category1 Case1      15     118   103.5625   13.29646      16 1.0858154   TRUE  FALSE  FALSE
## 5    File2 Category1 Case1      16      90   103.5625   13.29646      16 1.0200084   TRUE  FALSE  FALSE
## 6    File2 Category1 Case1      14      91   103.5625   13.29646      16 0.9448004  FALSE  FALSE  FALSE

答案 1 :(得分:0)

这是一个我已经解决的简单功能:

myFunction <- function(df,FileNameStr,CategoryStr,CaseStr){
    df <- subset(df, FileName == FileNameStr & Category == CategoryStr & Case == CaseStr)
    df <- df %>% group_by(TestNum) %>% summarise(FileName = FileName[1], Version = Version[1], Category = Category[1], RepMean = mean(Value), Case = Case[1])
    df <- df[c(2, 3, 4, 5, 1, 6)]
    df$Sigma1 <- (df$RepMean > (mean(df$RepMean, na.rm=TRUE)) + sd(df$RepMean, na.rm=TRUE))|(df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - sd(df$RepMean, na.rm=TRUE))
    df$Sigma2 <- (df$RepMean > (mean(df$RepMean,na.rm=TRUE)) + 2 * (sd(df$RepMean, na.rm=TRUE))) | (df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - 2 * (sd(df$RepMean, na.rm=TRUE)))
    df$Sigma3 <- (df$RepMean > (mean(df$RepMean, na.rm=TRUE)) + 3 * (sd(df$RepMean, na.rm=TRUE))) | (df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - 3 * (sd(df$RepMean, na.rm=TRUE)))
    return(df)
}

最理想的是,我喜欢以某种方式在rbind.fill位工作。我觉得必须有一个更快的方法,而不是手动输入168个数据框名称到函数中。尽管如此,这是创建每个数据框的功能。

编辑:我已经对我的原始问题进行了编辑(在“第二次编辑”下),该问题解释了我需要能够为此功能添加过滤功能。