Question

我有一个文件，其中每一行是在实验的特定复制中收集的一组结果。每个实验中的结果数（即每行中的列数）可以不同。对每行中结果的顺序也没有重要性（第1行中的第一个结果和第一个结果2与任何其他对的关联性不大;这些设置结果）。

该文件看起来像这样：

2141 0 5328 5180 357 5335 1 5453 5325 5226 7 4880 5486 0 
2650 0 5280 4980 5243 5301 4244 5106 5228 5068 5448 3915 4971 5585 4818 4388 5497 4914 5364 4849 4820 4370
2069 2595 2478 4941 
2627 3319 5192 5106 32 4666 3999 5503 5085 4855 4135 4383 4770 
2005 2117 2803 2722 2281 2248 2580 2697 2897 4417 4094 4722 5138 5004 4551 5758 5468 17361 
1914 1977 2414 100 2711 2171 3041 5561 4870 4281 4691 4461 5298 3849 5166 5578 5520 4634 4836 4905 5105 5089
2539 2326 0 4617 3735 0 5122 5439 5238 1
25 5316 21173 4492 5038 5944 5576 5424 5139 5184 5 5096 4963 2771 2808 2592 2
4963 9428 17152 5467 5202 6038 5094 5221 5469 5079 3753 5080 5141 4097 5173 11338 4693 5273 5283 5110 4503 51
2024 2 2822 5097 5239 5296 4561

除了每一行更长（最多几千个值）。可以看出，所有值都是非负整数。

简而言之 - 这不是一个普通的表，其中列有意义。它只是一堆结果 - 每一个都排成一行。

我想阅读所有结果，然后对每个实验（行）进行一些操作，例如计算ecdf。我还想计算所有重复的平均ecdf。

我的问题 - 我该怎么读这个看起来很奇怪的文件？我很习惯于read.table，我不确定我是否曾尝试过其他任何东西......我是否必须使用一些低级别的 readlines？我猜首选输出是矢量列表（或矢量？）。我查看了scan，但似乎所有向量都必须与那里的长度相同。

任何建议都将受到赞赏。

更新按照以下建议，我现在执行以下操作：

con <- file('myfile') 
open(con);
results.list <- list();
current.line <- 1
while (length(line <- readLines(con, n = 1, warn = FALSE)) > 0) {
 results.list[[current.line]] <- as.integer(unlist(strsplit(line, split=" ")))
 current.line <- current.line + 1
} 
close(con)

似乎工作。它看起来不错吗？

当我summary(results.list)时，我得到：长度等级模式

      Length Class  Mode  
 [1,] 1091   -none- numeric
 [2,] 1070   -none- numeric
   ....

该类不应该是整数吗？什么是模式？

Answer 1

Josh链接的例子是我一直使用的例子。

inputFile <- "/home/jal/myFile.txt"
con  <- file(inputFile, open = "r")

dataList <- list()
ecdfList <- list()

while (length(oneLine <- readLines(con, n = 1, warn = FALSE)) > 0) {
    myVector <- (strsplit(oneLine, " "))
    myVector <- list(as.numeric(myVector[[1]]))
    dataList <- c(dataList,myVector)

    myEcdf <- ecdf(myVector[[1]])
    ecdfList <- c(ecdfList,myEcdf)

  } 

close(con)

我编辑了示例，根据示例数据创建了两个列表。 dataList是一个列表，其中列表中的每个项目都是文本文件中每行的数值向量。 ecdfList是一个列表，其中每个元素都是文本文件中每行的ecdf。

你应该在那里添加一些try（）或trycatch（）逻辑来正确处理由于null或某些因素而无法创建ecdf的情况。但上面的例子应该让你非常接近。祝好运！

Answer 2

是的，您可以使用readLines。 JD Long has a good example，我已略微编辑并在下面提供。

con  <- file(inputFile, open = "r")

while (length(oneLine <- readLines(con, n = 1, warn = FALSE)) > 0) {
  # do stuff
} 

close(con)

Answer 3

为什么要逐行阅读？

results.list <- lapply(strsplit(readLines("myfile")," "), as.integer)

给出整数向量列表。

关于您的其他问题：请查看?mode（简而言之 - mode是数字的数字，typeof可以是整数或双数，class数字或整数）。要查看是否有整数，请选中str(results.list)或lapply(results.list, class)。

Answer 4

或者：

df <- read.delim(file="whatever", header=F, sep = " ")

Answer 5

使用

line <- readLines(con, 1)

从连接con读取一行，这可以像con <- file(filename, "r")一样简单。

Answer 6

如果您知道文件中的值是整数，则可以使用scan()代替readLines()，但也可以使用循环：

open(con)
results.list <- list();
current.line <- 1
while( length(line <- scan(con,what=integer(0),nlines=1,quiet=TRUE))>0 ) {
  results.list[[current.line]] <- line
  current.line <- current.line + 1
}
close(con)

您将获得一个数字向量列表。

什么是在R中逐行读取的好方法？

6 个答案: