Question

好的，为了设置场景，我编写了一个函数来从MySQL导入多个表（使用RODBC）并对它们运行randomForest（）。此函数在多个数据库上运行（作为单独的实例）。在一个特定的数据库和一个特定的表中，as.POSIXlt.character（x，tz，.....）中的＆＃34;错误：字符串不是标准的明确格式＆＃34;错误被抛出。该函数在两个数据库中的大约150个表上运行，除了这一个表之外没有任何问题。

这是表格中的head（）打印：

MQLTime bar5 bar4 bar3 bar2 bar1 pat1 baXRC
1 2014-11-05 23:35:00  184   24    8   24   67  147  Flat
2 2014-11-05 23:57:00  203  184  204   67   51  147  Flat
3 2014-11-06 00:40:00  179  309   49  189   75   19  Flat
4 2014-11-06 00:46:00   28  192   60   49  152  147  Flat
5 2014-11-06 01:20:00  309   48    9   11   24   19  Flat
6 2014-11-06 01:31:00   24  177   64  152  188   19  Flat

这是功能：

GenerateRF <- function(db, countstable, RFcutoff) {  

'load required libraries'
  library(RODBC)
  library(randomForest)
  library(caret)
  library(ff)
  library(stringi)

'connection and data preparation'
  connection <- odbcConnect ('TTODBC', uid='root', pwd='password', case="nochange")

'import count table and check if RF is allowed to be built'
  query.str <- paste0 ('select * from ', db, '.', countstable, ' order by RowCount asc')
      row.counts <- sqlQuery (connection, query.str)

'Operate only on tables that have >= RFcutoff'
  for (i in 1:nrow (row.counts)) {
    table.name <- as.character (row.counts[i,1])
    col.count <- as.numeric (row.counts[i,2])
    row.count <- as.numeric (row.counts[i,3])

    if (row.count >= 20) {

'Delete old RFs and DFs for input pattern'
    if (file.exists (paste0 (table.name, '_RF.Rdata'))) {
          file.remove (paste0 (table.name, '_RF.Rdata'))
    }
    if (file.exists (paste0 (table.name, '_DF.Rdata'))) {
          file.remove (paste0 (table.name, '_DF.Rdata'))
    }

'import and clean data'
      query.str2 <- paste0 ('select * from ', db, '.', table.name, ' order by mqltime asc')
          raw.data <- sqlQuery(connection, query.str2) 

'partition data into training/test sets'
      set.seed(489)
          index <- createDataPartition(raw.data$baXRC, p=0.66, list=FALSE, times=1)
              data.train <- raw.data [index,]
              data.test <- raw.data [-index,]

'find optimal trees to grow (without outcome and dates)
      data.mtry <- as.data.frame (tuneRF (data.train [, c(-1,-col.count)], data.train$baXRC, ntreetry=100,
                 stepFactor=.5, improve=0.01, trace=TRUE, plot=TRUE, dobest=FALSE)) 
          best.mtry <- data.mtry [which (data.mtry[,2] == min (data.mtry[,2])), 1]

'compress df'
      data.ff <- as.ffdf (data.train)

'run RF. Originally set to 1000 trees but M1 dataset is to large for laptop. Maybe train at the lab?'
      data.rf <- randomForest (baXRC~., data=data.ff[,-1], mtry=best.mtry, ntree=500, keep.forest=TRUE,
               importance=TRUE, proximity=FALSE)

'generate and print variable importance plot'
      varImpPlot (data.rf, main = table.name)

'predict on test data'
      data.test.pred <- as.data.frame( predict (data.rf, data.test, type="prob"))

'get dates and name date column'
      data.test.dates <- data.frame (data.test[,1])
          colnames (data.test.dates) <- 'MQLTime'

'attach dates to prediction df'
      data.test.res <- cbind (data.test.dates, data.test.pred)

'force date coercion to attempt negating unambiguous format error '
      data.test.res$MQLTime <- format(data.test.res$MQLTime, format = "%Y-%m-%d %H:%M:%S")

'delete row names, coerce to dataframe, generate row table name and export outcomes to MySQL'
      rownames (data.test.res)<-NULL
      data.test.res <- as.data.frame (data.test.res)
      root.table <- stri_sub(table.name, 0, -5)
          sqlUpdate (connection, data.test.res, tablename = paste0(db, '.', root.table, '_outcome'), index = "MQLTime")

'save RF and test df/s for future use; save latest version of row_counts to MQL4 folder'
      save (data.rf, file = paste0 ("C:/Users/user/Documents/RF_test2/", table.name, '_RF.Rdata'))
      save (data.test, file = paste0 ("C:/Users/user/Documents/RF_test2/", table.name, '_DF.Rdata'))
      write.table (row.counts, paste0("C:/Users/user/AppData/Roaming/MetaQuotes/Terminal/71FA4710ABEFC21F77A62A104A956F23/MQL4/Files/", db, "_m1_rowcounts.csv"), sep = ",", col.names = F, 
        row.names = F, quote = F)

'end of conditional block'
    }

'end of for loop'
  }

'close all connection to MySQL'
  odbcCloseAll()

'clear workspace'
  rm(list=ls())

'end of function'  
}

在这一行：

data.test.res$MQLTime <- format(data.test.res$MQLTime, format = "%Y-%m-%d %H:%M:%S")

我尝试使用各种功能强制MQLTime，包括：as.character(), as.POSIXct(), as.POSIXlt(), as.Date(), format(), as.character(as.Date())

并且还尝试过：

"%y" vs "%Y" and "%OS" vs "%S"

所有变体似乎对错误没有影响，并且该函数仍然可以在所有其他表上运行。我手动检查了表（其中包含近1500行），并在MySQL中查找NULL日期或日期，例如＆＃34; 0000-00-00 00：00：00＆＃34;。

另外，如果我在R终端中逐行运行该功能，那么这个有问题的表处理没有任何问题，只会让我感到困惑。

我已经用尽了我能想到的所有功能/解决方案（以及我可以通过Google博士找到的所有功能/解决方案），所以我在这里请求帮助。我应该提一下，MQLTime列在MySQL中存储为varchar（）。这样做是为了尝试解决R和MySQL之间类型转换的问题

SHOW VARIABLES LIKE "%version%";
innodb_version, 5.6.19
protocol_version, 10
slave_type_conversions, 
version, 5.6.19
version_comment, MySQL Community Server (GPL)
version_compile_machine, x86
version_compile_os, Win32


> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: i386-w64-mingw32/i386 (32-bit)

编辑：从MySQl导入的数据上的Str（）输出，显示MQLTime已经是POSIXct格式：

> str(raw.data)
'data.frame':   1472 obs. of  8 variables:
 $ MQLTime: POSIXct, format: "2014-11-05 23:35:00" "2014-11-05 23:57:00" "2014-11-06 00:40:00" "2014-11-06 00:46:00" ...
 $ bar5   : int  184 203 179 28 309 24 156 48 309 437 ...
 $ bar4   : int  24 184 309 192 48 177 48 68 60 71 ...
 $ bar3   : int  8 204 49 60 9 64 68 27 192 147 ...
 $ bar2   : int  24 67 189 49 11 152 27 56 437 67 ...
 $ bar1   : int  67 51 75 152 24 188 56 147 71 0 ...
 $ pat1   : int  147 147 19 147 19 19 147 19 147 19 ...
 $ baXRC  : Factor w/ 3 levels "Down","Flat",..: 2 2 2 2 2 2 2 2 2 3 ...

所以我尝试在数据帧操作中声明stringsAsfactors = FALSE，这没有效果。

有趣的是，如果通过第一个＆＃39; if＆＃39;中的附加条件声明删除了违规表。阻止，该函数在阻塞表之前的表上停止。

如果从处理中删除了原始和新的违规表，则该函数会在它们之前的表上停止。我之前从未见过这种行为，它确实让我感到难过。

我在功能期间观看了系统资源，但它们似乎永远不会最大化。

这可能是＆＃39; for＆＃39;循环而不一定是日期格式？

Answer 1

我的脸上似乎有一些鸡蛋。函数停止的表后面的表有一行值为＆00; 00：00＆00;我在MySQL函数中添加了另一个语句，用于在预处理表时删除这些行。感谢那些看过这个的人。

标准的明确格式[R] MySQL导入数据

1 个答案: