在R

时间:2019-02-12 13:42:44

标签: r for-loop formatting

我对R很陌生,所以我希望这个问题仍然很有趣。我创建了一个for循环,该循环产生了11个csv文件。这是我用于解决此问题的代码,

for (i in seq(0, 1, by = 0.1))
{collar$results2<-mutate(collar,results2 = case_when( (probability > i & results1 == "POSITIVE") | (probability < i & results1 == "NEGATIVE") ~ TRUE, TRUE ~ FALSE) )
as.character(collar$results2)
collaraccuracy1=paste('collar41361_41365', i, 'csv', sep = '.')
write.csv(collar,collaraccuracy1)}

如您所见,所有创建的文件都具有以下格式:collar41361_41365.i.csv,其中“ i”的范围是每0.1到0到1,如下所示:

[1] "collar41361_41365.0.csv"
[1] "collar41361_41365.0.1.csv"
[1] "collar41361_41365.0.2.csv"
[1] "collar41361_41365.0.3.csv"
[1] "collar41361_41365.0.4.csv"
[1] "collar41361_41365.0.5.csv"
[1] "collar41361_41365.0.6.csv"
[1] "collar41361_41365.0.7.csv"
[1] "collar41361_41365.0.8.csv"
[1] "collar41361_41365.0.9.csv"
[1] "collar41361_41365.1.csv"

现在,我想一次格式化所有文件,因为它们具有相同的结构(10列,240行和相同的列标题)和相同的名称格式。

请参见下面的代码,其中包含我一直尝试接管的这11个文件。我使用Sys.glob是因为在另一篇文章中提到这是执行任务的最佳方法。之前,我已经为单个文件编写了此操作的代码,并且可以正常工作。我现在想一次对所有11个文件应用代码:

#1) Reading multiple files at one. Now, this will only work for the files with a decimal value of i in their name -which is fine-. If I was reading files with i=0 or i=1, then we'll have the pattern "collar41361_41365.*.csv". Am I right?

collaraccuracy<-lapply(Sys.glob("collar41361_41365.***.csv"), read.csv)

#2) Select only the columns with header "observed","predicted","probability","results1","results2.results2"

collaraccuracy<-fread("collar41361_41365.***.csv",select=c("observed","predicted","probability","results1","results2.results2"),stringsAsFactors = F)

#3) Rename column "results2.results2" to "results2"

colnames(collaraccuracy)<-c("observed","predicted","probability","results1","results2")

#4) Create 6th column "results" by merging columns "results1" and "results2"

collaraccuracy$results <- paste(collaraccuracy$results2, 
collaraccuracy$results1,sep="_")


#5) End of the formatting. Write new formated csv files with the pattern "collar41361_by_41365.i.csv"

collaraccuracy2=paste('collar41361_by_41365', i, 'csv', sep = '.')
write.csv(collaraccuracy,collaraccuracy2)

如您所见,对于i值(0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9),我有5个不同的操作来计算应该最终得到9个文件

我特别关注动作1)和2)中的语法,但这是迄今为止我能做的最好的事情。

有什么技巧可以制定这个?任何帮助表示赞赏!

P.S。更新:我已经尝试创建一个函数并将其通过lapply应用于其余文件:

collarcolumns<-function(collaraccuracy1)
{collaraccuracy1<-fread(("collar41361_41365.1.csv"),select=c("observed","predicted","probability","results1","results2.results2"),stringsAsFactors = F)
colnames(collaraccuracy1)<-c("observed","predicted","probability","results1","results2")
collaraccuracy1$results <- paste(collaraccuracy1$results2, collaraccuracy1$results1,sep="_")
collaraccuracy2=paste('collar41361_by_41365', i, 'csv', sep = '.')
write.csv(collaraccuracy1,collaraccuracy2)}

lapply(Sys.glob("collar41361_41365.*.csv"), collarcolumns)

R印刷了11张"NULL"。我走在正确的轨道上吗?

1 个答案:

答案 0 :(得分:1)

退一步,听起来您想对每个i执行以下操作:

  • 添加一列results2,该列检查预测值是否与概率为i的观测值相符。
  • 添加一列results,该列将results1results2连接起来。

您看到诸如results2.results2之类的奇怪列名的原因是原始的for循环是多余的;您不需要赋值语句(collar$results2 <- ...)和mutate。我们可以将整个过程分解为一个循环,如下所示:

for(i in seq(0, 1, by = 0.1)) {
  collar.temp = collar %>%
    mutate(results2 = case_when((probability > i & results1 == "POSITIVE") |
                                  (probability < i & results1 == "NEGATIVE") ~ T,
                                T ~ F)) %>%
    mutate(results = paste(results1, results2, sep = "_"))
  collaraccuracy1 = paste('collar41361_41365', i, 'csv', sep = '.')
  write.csv(collar.temp, collaraccuracy1)
}

再退一步,确定要11个单独的表吗?在我看来,您正在有效地检查各种“置信度”临界值下预测的准确性。将数据整理为整齐格式的一种方法是这样的,其中cutoff是其自己的列:

collar.tidy = do.call(
  "bind_rows",
  lapply(
    seq(0, 1, by = 0.1),
    function(x) {
      collar %>%
        mutate(cutoff = x,
               results2 = case_when((probability > x & results1 == "POSITIVE") |
                                      (probability < x & results1 == "NEGATIVE") ~ T,
                                    T ~ F)) %>%
        mutate(results = paste(results1, results2, sep = "_"))
    }
  )
)

有关整洁数据的详细介绍,请参见here。您可能会想到其他方法来整理此数据集;例如,对于我来说,尚不清楚是否严格需要将连接其他两列的results列。