我对R不是很擅长,最近我一直在努力学习如何很好地编写函数。所以我有一段代码,如果我写在"非功能"它最终将超过一千行代码。问题在于,它实际上只有大约六行的独特"代码,但它运行在大型数据集的不同子集上。
df <- subset(data, FileName == "File Name" & Category == "Category Name" & Case == "Case Name")
df <- df %>% group_by(TestNum) %>% summarise(FileName = FileName[1], Version = Version[1], Measure = Measure[1], RepMean = mean(Value), Case = Case[1])
df <- df[c(2, 3, 4, 5, 1, 6)]
df$Sigma1 <- (df$RepMean > (mean(df$RepMean, na.rm=TRUE)) + sd(df$RepMean, na.rm=TRUE))|(df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - sd(df$RepMean, na.rm=TRUE))
df$Sigma2 <- (df$RepMean > (mean(df$RepMean,na.rm=TRUE)) + 2 * (sd(df$RepMean, na.rm=TRUE))) | (df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - 2 * (sd(df$RepMean, na.rm=TRUE)))
df$Sigma3 <- (df$RepMean > (mean(df$RepMean, na.rm=TRUE)) + 3 * (sd(df$RepMean, na.rm=TRUE))) | (df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - 3 * (sd(df$RepMean, na.rm=TRUE)))
原始数据集在FileName
列中有6个唯一值,Category
列中有7个,Case
列中有4个,这意味着我创建了168具有这些代码行的唯一df
数据框,我使用rbind.fill
来创建单个数据框(&#34; StatTable&#34;)然后我将其运行:< / p>
LatestTestNum <- max(data$TestNum, na.rm=TRUE)
ControlTable <- subset(StatTable, (Sigma1 == "TRUE" | Sigma2 == "TRUE" | Sigma3 == "TRUE") & TestNum == LatestTestNum)
ControlTable <- ControlTable[, c("FileName, "Category", "Case", "Sigma1", "Sigma2", "Sigma3")]
ControlTable
是我正在寻找的最终产品。
这是一个功能会大大减少我的乏味痛苦的东西吗?特别是当我想修改它的工作方式时,它需要手动更改每个df代码。
编辑:这里解释了原始数据集的每一列中的内容。
Column1:FileName -- The name of the file that the data comes from
Column2:Version -- The version of the software that data comes from
Column3:Category -- The particular data type measured
Column4:Value -- The value of the data
Column5:TestNum -- TestNum is an integer value associated with the version number. This makes it easier to organize and sort data rather than using the Version column which is a string. (So for example version 1.0 might be TestNum=1 and 1.1 TestNum=2)
Column6:RepNum -- The replication count of that version. (Files are run multiple times per version)
Column7:Case -- There are different ways that the software is "setup" for data collection.
这是一个有效的数据集。
FileName <- c("File1","File1","File1","File1","File2","File2","File2","File2","File1","File1","File1","File1","File2","File2","File2","File2","File1","File1","File1","File1","File2","File2","File2","File2","File1","File1","File1","File1","File2","File2","File2","File2")
Version <- c("1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.1","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2","1.0.2")
Category <- c("Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2","Category1","Category1","Category2","Category2")
Value <- rpois(n = 32, lambda = 100)
TestNum <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2)
RepNum <- c(1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2)
Case <- c("Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case1","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2","Case2")
df <- data.frame(FileName,Version,Category,Value,TestNum,RepNum,Case)
第二次编辑:我自己提供了一个答案,因为我已经制定了一个功能,可以完成导致各种&#34; df&#34的基本步骤;具有唯一的&#34; FileName&#34;,&#34;类别&#34;和&#34;案例&#34;。我仍然希望能够满足168个不同数据框架的需求,但我希望能够添加到此功能中的主要功能是能够过滤掉某些TestNum
& #39;很容易。
例如,我的一个独特数据框最适合这个子集:
df <- subset(data, FileName == "File1" & Category == "Category1" & Case == "Case1" &
TestNum > 11)
但另一个数据框可能最适合这个子集:
df <- subset(data, FileName == "File1" & Category == "Category1" & Case == "Case1" &
(TestNum > 8 & TestNum != 21 & TestNum != 32))
我想我应该能够添加&#34; TestNum&#34;作为我的功能的另一个论点,但我不确定如何能够控制我可以过滤多少。
因为我正在处理大量现有数据集,所以我调整了每个sigma值,以便从平均值和标准差计算中过滤出某些数据点(这对于实际使用是必要的)检测超出这些sigma值的新数据点 - 此代码的整个目的)。有没有办法编写一个也可以进行相同调整的函数?
答案 0 :(得分:1)
对于168个唯一的FileName / Category / Case组合,在函数中使用dplyr
方式似乎是完全自然的。第一组按FileName / Category / Case / TestNum获取你的RepMeans,然后按FileName / Category / Case分组并进行计算,得到它是1,2或3 SD。而不是你的比较代码,在这里我首先计算SD的数量,然后使用它,这感觉更自然,并且重复计算更少。
df %>% group_by(FileName, Category, Case, TestNum) %>%
summarise(RepMean = mean(Value)) %>%
group_by(FileName, Category, Case) %>%
mutate(diff.sd = abs((RepMean - mean(RepMean, na.rm=TRUE))/sd(RepMean, na.rm=TRUE)),
Sigma1 = diff.sd > 1,
Sigma2 = diff.sd > 2,
Sigma3 = diff.sd > 3)
对于您的其他子集,我认为将其简单地删除您不想要的行,而不是包含您执行的行,这似乎是最自然的。一旦您从完整数据集中删除它们,您就可以运行此代码。
编辑以演示输出:这里我通过向原始数据添加异常值来显示sigma值为1 +,2 +和3+的输出,并且还使其适应每个组中的更多数据点,并且只有一个每个TestNum。在这个版本中,我还输出组均值,sd和大小,以确保它们都能正常工作。
df$Case <- "Case1"
df$Category <- "Category1"
df$TestNum <- 1:nrow(df)
df$Value[1] <- 5000
df$Value[5] <- 140
out <- df %>% group_by(FileName, Category, Case, TestNum) %>%
summarise(RepMean = mean(Value)) %>%
group_by(FileName, Category, Case) %>%
mutate(group.mean=mean(RepMean, na.rm=TRUE),
group.sd=sd(RepMean, na.rm=TRUE),
group.n=length(RepMean),
diff.sd = abs((RepMean - mean(RepMean, na.rm=TRUE))/sd(RepMean, na.rm=TRUE)),
Sigma1 = diff.sd > 1,
Sigma2 = diff.sd > 2,
Sigma3 = diff.sd > 3)
head(out[order(-out$diff.sd),])
## Source: local data frame [6 x 12]
## Groups: FileName, Category, Case [2]
##
## FileName Category Case TestNum RepMean group.mean group.sd group.n diff.sd Sigma1 Sigma2 Sigma3
## <fctr> <chr> <chr> <int> <dbl> <dbl> <dbl> <int> <dbl> <lgl> <lgl> <lgl>
## 1 File1 Category1 Case1 1 5000 402.4375 1226.06857 16 3.7498413 TRUE TRUE TRUE
## 2 File2 Category1 Case1 5 140 103.5625 13.29646 16 2.7403912 TRUE TRUE FALSE
## 3 File2 Category1 Case1 13 85 103.5625 13.29646 16 1.3960483 TRUE FALSE FALSE
## 4 File2 Category1 Case1 15 118 103.5625 13.29646 16 1.0858154 TRUE FALSE FALSE
## 5 File2 Category1 Case1 16 90 103.5625 13.29646 16 1.0200084 TRUE FALSE FALSE
## 6 File2 Category1 Case1 14 91 103.5625 13.29646 16 0.9448004 FALSE FALSE FALSE
答案 1 :(得分:0)
这是一个我已经解决的简单功能:
myFunction <- function(df,FileNameStr,CategoryStr,CaseStr){
df <- subset(df, FileName == FileNameStr & Category == CategoryStr & Case == CaseStr)
df <- df %>% group_by(TestNum) %>% summarise(FileName = FileName[1], Version = Version[1], Category = Category[1], RepMean = mean(Value), Case = Case[1])
df <- df[c(2, 3, 4, 5, 1, 6)]
df$Sigma1 <- (df$RepMean > (mean(df$RepMean, na.rm=TRUE)) + sd(df$RepMean, na.rm=TRUE))|(df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - sd(df$RepMean, na.rm=TRUE))
df$Sigma2 <- (df$RepMean > (mean(df$RepMean,na.rm=TRUE)) + 2 * (sd(df$RepMean, na.rm=TRUE))) | (df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - 2 * (sd(df$RepMean, na.rm=TRUE)))
df$Sigma3 <- (df$RepMean > (mean(df$RepMean, na.rm=TRUE)) + 3 * (sd(df$RepMean, na.rm=TRUE))) | (df$RepMean < (mean(df$RepMean, na.rm=TRUE)) - 3 * (sd(df$RepMean, na.rm=TRUE)))
return(df)
}
最理想的是,我喜欢以某种方式在rbind.fill
位工作。我觉得必须有一个更快的方法,而不是手动输入168个数据框名称到函数中。尽管如此,这是创建每个数据框的功能。
编辑:我已经对我的原始问题进行了编辑(在“第二次编辑”下),该问题解释了我需要能够为此功能添加过滤功能。