如何将两个向量组合成一个数据结构,然后循环它?

时间:2015-04-24 15:28:37

标签: r

我正在努力避免重复代码来循环两组文件('yes'和'no'模型训练文件),所以我将两个文件名向量组合成一个data.frame以及一个额外的位用于跟踪文件是“是”文件还是“否”文件的元数据。结果数据结构看起来正确,但后来我无法弄清楚如何循环data.frame。

也许最好的解决方案是将两个向量组合成不同类型的数据结构(即不是data.frame)?

> yesFiles = c("yFile1", "yFile2", "yFile3", "yFile4")
> noFiles = c("nFile1", "nFile2", "nFile3", "nFile4")
> allFiles = data.frame(result=c(rep("yes", times=length(yesFiles)), rep("no", times=length(noFiles))), name=c(yesFiles, noFiles))
> allFiles
  result   name
1    yes yFile1
2    yes yFile2
3    yes yFile3
4    yes yFile4
5     no nFile1
6     no nFile2
7     no nFile3
8     no nFile4
> 
> for (file in allFiles) { cat(sep="", file$result, ": ", file$name, "\n") }
Error in file$result : $ operator is invalid for atomic vectors
>
> for (file in allFiles) { cat(sep="", file['result'], ": ", file['name'], "\n") }
NA: NA
NA: NA
> 

循环似乎是循环遍历列,而不是行。如何让它循环遍历行?或者是否有更好的方法来组合数据以允许在单个循环中循环它们?

然后我尝试以不同的方式循环同一个结构,但仍然无效...

> yesFiles = c("yFile1", "yFile2", "yFile3", "yFile4")
> noFiles = c("nFile1", "nFile2", "nFile3", "nFile4")
> allFiles = data.frame(result=c(rep("yes", times=length(yesFiles)), rep("no", times=length(noFiles))), name=c(yesFiles, noFiles))
> allFiles
  result   name
1    yes yFile1
2    yes yFile2
3    yes yFile3
4    yes yFile4
5     no nFile1
6     no nFile2
7     no nFile3
8     no nFile4
> 
> allFiles[1,1]
[1] yes
Levels: no yes
> allFiles[1,2]
[1] yFile1
Levels: nFile1 nFile2 nFile3 nFile4 yFile1 yFile2 yFile3 yFile4
> # ...ah, great! These seem to be giving me what I need.
> 
> for (i in 1:nrow(allFiles)) {
+    result = allFiles[i,1]
+    file = allFiles[i,2]
+    cat(sep="", "File '", file, "' is a '", result, "' file.\n")
+ }
File '5' is a '2' file.
File '6' is a '2' file.
File '7' is a '2' file.
File '8' is a '2' file.
File '1' is a '1' file.
File '2' is a '1' file.
File '3' is a '1' file.
File '4' is a '1' file.
> # ...wha? What's up with the numbers? I thought [1,1], etc, gave strings!

我做错了什么?

以下是关于我需要在循环中实际执行的内容的其他信息 ,'Colonel Beauvel' < < / strong>在他的回答下面的评论.....

首先,我需要一个实用程序函数来转换.csv文件的每一行上的文本时间戳:

#-----------------------------------------------
# Read a text timestamp of the form "yyyy-mm-ddThh:mm:ss.xxx",
# where xxx=milliseconds. Returns a numeric value of the seconds
# since Jan 1 1970, with millisecond precision (i.e. 3 decimal places).
#
readTimestamp = function (tstamp) {
  as.numeric(strptime(tstamp,format='%Y-%m-%dT%H:%M:%S.')) +
  as.numeric(substr(tstamp,20,23))
}

现在,我想要运行的循环(代码尚未调试,所以我确定它有问题):

colnamesToKeep = union("Seconds", sensorNamesForThisModel)
dataset = list() # Eventually 'dataset' will hold all training data from all files
for (file in allFiles)
{
   cat(sep="", "Reading '", file['result'], "' file \"", file['name'], "\".\n")
   tmp = read.csv(file['name'], na.strings=c(".", "NA", "", "?"), strip.white=TRUE, encoding="UTF-8")
   attr(tmp, "names")[1] = "Seconds"   # Rename column 1 to "Seconds" (it's not yet, but it will be)
   tmp = tmp[,-2:-4]      # Delete these columns; they're irrelevant to the KSVM model
   beginTime = readTimestamp(tmp[1,1])
   # Convert column 1 from text timestamps to numeric seconds (msec precision) starting at 0.000
   tmp[,1] = readTimestamp(tmp[,1]) - beginTime
   # Delete all columns for sensors that this model cares nothing about...
   colIndicesToDelete = -which(!(colnames(tmp) %in% colnamesToKeep))
   tmp = tmp[,colIndicesToDelete] # Delete all columns for sensors that this model cares nothing about
   dataset[[length(dataset)+1]] = list(result=file['result'], data=tmp) # Add this to the training dataset
}

我对任何&amp;所有建议,尤其是“您不应该使用union()创建colnamesToKeep变量”。非常感谢你!

2 个答案:

答案 0 :(得分:0)

我想出来了,如下所示。但我仍然非常愿意接受有关更好的方式的建议。

> yesFiles = c("yFile1", "yFile2", "yFile3", "yFile4")
> noFiles = c("nFile1", "nFile2", "nFile3", "nFile4")
> allFiles = data.frame(result=c(rep("yes", times=length(yesFiles)), rep("no", times=length(noFiles))), name=c(yesFiles, noFiles))
> allFiles
  result   name
1    yes yFile1
2    yes yFile2
3    yes yFile3
4    yes yFile4
5     no nFile1
6     no nFile2
7     no nFile3
8     no nFile4
> 
> 
> 
> 
> 
> for (i in 1:nrow(allFiles)) {
+    result = as.character(allFiles[[i,1]])
+    file = as.character(allFiles[[i,2]])
+    cat(sep="", "File '", file, "' is a '", result, "' file.\n")
+ }
File 'yFile1' is a 'yes' file.
File 'yFile2' is a 'yes' file.
File 'yFile3' is a 'yes' file.
File 'yFile4' is a 'yes' file.
File 'nFile1' is a 'no' file.
File 'nFile2' is a 'no' file.
File 'nFile3' is a 'no' file.
File 'nFile4' is a 'no' file.
> 

答案 1 :(得分:0)

尝试使用长度为(文件)的索引i,并在您对数据框进行循环后使用它来对数据框进行子集化。您可以使用df $ column [i]:

提取列的值
yesFiles = c("yFile1", "yFile2", "yFile3", "yFile4")
noFiles = c("nFile1", "nFile2", "nFile3", "nFile4")
files = data.frame(result=c(rep("yes", times=length(yesFiles)), rep("no", times=length(noFiles))), 
                      name=c(yesFiles, noFiles),
                      stringsAsFactors=FALSE)
files

for (i in 1:length(files$name)) { 
  cat(sep="", files$result[i], ": ", files$name[i], "\n") 
  # Do other stuff here, the filepath is available via files$result[i]
  }

>yes: yFile1
>yes: yFile2
>yes: yFile3
>yes: yFile4
>no: nFile1
>no: nFile2
>no: nFile3
>no: nFile4