Question

所以，我的目标是编写一个函数，将任何csv文件，输出路径和任意数量的分割大小（按行数）作为输入，然后随机化并将数据拆分为适当的文件。如果我提前知道分割尺寸，我可以很容易地手动执行此操作，但我想要一个能够处理不同分割尺寸的自动化功能。看起来很简单，这就是我写的：

randomizer = function(startFile, endPath, ...){ ##where ... are the user-defined split sizes

           vec = unlist(list(...))

           n_files = length(vec)

           values = read.csv(startFile, stringsAsFactors = FALSE)

           values_rand = as.data.frame(values[sample(nrow(values)),])

           for(i in 1:n_files){
              if(nrow(values_rand)!=0 & !is.null(nrow(values_rand))){
              assign(paste('group', i , sep=''), values_rand[1:vec[i], ]);
              values_rand = as.data.frame(values_rand[(vec[i]+1):nrow(values_rand), ], stringsAsFactors = FALSE)
              ## (A) write.csv fn here?
                 } else {
               print("something went wrong")
                }
            }
## (B) write.csv fn here?
}
  }

当我尝试做某事时（A）像 write.csv(x= paste('group', i, sep=''), file= paste(endPath, '/group', i, '.csv', sep=''), row.names=FALSE 我得到错误或字面上将字符串“group1”写入csv，而不是我正在寻找的随机数据帧的块。我非常困惑，因为这似乎是我在反对R语义而不是真正的编程问题。提前感谢您的帮助。

Answer 1

你确实已经把自己编入了一个角落，对于初学者来说这是一个常见的角色，特别是从其他编程语言来到R的初学者。

使用assign是一个大红旗。至少当你开始使用这门语言时，如果你觉得自己要达到这个功能，那就停下来再思考一下。你最有可能完全错误地解决问题，需要重新考虑它。

这是我所描述的（完全未经测试的）版本，附带注释：

split_file <- function(startFile,endPath,sizes){
    #There's no need to use "..." for the partition sizes.
    # A simple vector of values is much simpler

    values <- read.csv(startFile,stringsAsFactors = FALSE)

    if (sum(sizes) != nrow(values)){
        #I'm assuming here that we're not doing anything fancy with bad input
        stop("sizes do not evenly partition data!")
    }else{
        #Shuffle data frame
        # Note we don't need as.data.frame()
        values <- values[sample(nrow(values)),]

        #Split data frame
        values <- split(values,rep(seq_len(nrow(values)),times = sizes))
        #Create the output file paths
        paths <- paste0(endPath,"/group_",seq_along(sizes))
        #We could shoe-horn this into lapply, but there's no real need
        for (i in seq_along(values)){
            write.csv(x = values[[i]],file = paths[i],row.names = FALSE)
        }
    }
}

函数用于随机化（逐行）df，拆分，然后写入csv

1 个答案: