我想在R中逐行读取一个文本文件,使用for循环和文件的长度。问题是它只打印字符(0)。这是代码:
fileName="up_down.txt"
con=file(fileName,open="r")
line=readLines(con)
long=length(line)
for (i in 1:long){
linn=readLines(con,1)
print(linn)
}
close(con)
答案 0 :(得分:96)
您应该注意readLines(...)
和大文件。读取内存中的所有行可能存在风险。下面是一个如何读取文件和处理一行的示例:
processFile = function(filepath) {
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
print(line)
}
close(con)
}
了解在内存中读取一行的风险。没有换行符的大文件也可以填满你的记忆。
答案 1 :(得分:38)
在您的文件中使用readLines
:
R> res <- readLines(system.file("DESCRIPTION", package="MASS"))
R> length(res)
[1] 27
R> res
[1] "Package: MASS"
[2] "Priority: recommended"
[3] "Version: 7.3-18"
[4] "Date: 2012-05-28"
[5] "Revision: $Rev: 3167 $"
[6] "Depends: R (>= 2.14.0), grDevices, graphics, stats, utils"
[7] "Suggests: lattice, nlme, nnet, survival"
[8] "Authors@R: c(person(\"Brian\", \"Ripley\", role = c(\"aut\", \"cre\", \"cph\"),"
[9] " email = \"ripley@stats.ox.ac.uk\"), person(\"Kurt\", \"Hornik\", role"
[10] " = \"trl\", comment = \"partial port ca 1998\"), person(\"Albrecht\","
[11] " \"Gebhardt\", role = \"trl\", comment = \"partial port ca 1998\"),"
[12] " person(\"David\", \"Firth\", role = \"ctb\"))"
[13] "Description: Functions and datasets to support Venables and Ripley,"
[14] " 'Modern Applied Statistics with S' (4th edition, 2002)."
[15] "Title: Support Functions and Datasets for Venables and Ripley's MASS"
[16] "License: GPL-2 | GPL-3"
[17] "URL: http://www.stats.ox.ac.uk/pub/MASS4/"
[18] "LazyData: yes"
[19] "Packaged: 2012-05-28 08:47:38 UTC; ripley"
[20] "Author: Brian Ripley [aut, cre, cph], Kurt Hornik [trl] (partial port"
[21] " ca 1998), Albrecht Gebhardt [trl] (partial port ca 1998), David"
[22] " Firth [ctb]"
[23] "Maintainer: Brian Ripley <ripley@stats.ox.ac.uk>"
[24] "Repository: CRAN"
[25] "Date/Publication: 2012-05-28 08:53:03"
[26] "Built: R 2.15.1; x86_64-pc-mingw32; 2012-06-22 14:16:09 UTC; windows"
[27] "Archs: i386, x64"
R>
有一本专门用于此的手册......
答案 2 :(得分:33)
以下是带有for
循环的解决方案。重要的是,它会从for循环中调用readLines
,因此不会一次又一次地调用它。这是:
fileName <- "up_down.txt"
conn <- file(fileName,open="r")
linn <-readLines(conn)
for (i in 1:length(linn)){
print(linn[i])
}
close(conn)
答案 3 :(得分:4)
我编写了一个代码来逐行读取文件以满足我的需求,不同的行具有不同的数据类型,请参阅文章:read-line-by-line-of-a-file-in-r和determining-number-of-linesrecords。我认为它应该是大文件的更好解决方案。我的R版(3.3.2)。
CamerActivity
答案 4 :(得分:1)
我建议您签出chunked
和disk.frame
。它们都具有逐块读取CSV的功能。
尤其是disk.frame::csv_to_disk.frame
可能是您所追求的功能?
答案 5 :(得分:0)
fileName = "up_down.txt"
### code to get the line count of the file
length_connection = pipe(paste("cat ", fileName, " | wc -l", sep = "")) # "cat fileName | wc -l" because that returns just the line count, and NOT the name of the file with it
long = as.numeric(trimws(readLines(con = length_connection, n = 1)))
close(length_connection) # make sure to close the connection
###
for (i in 1:long){
### code to extract a single line at row i from the file
linn_connection_cmd = paste("head -n", format(x = i, scientific = FALSE, big.mark = ""), fileName, "| tail -n 1", sep = " ") # extracts one line from fileName at the desired line number (i)
linn_connection = pipe(linn_connection_cmd)
linn = readLines(con = linn_connection, n = 1)
close(linn_connection) # make sure to close the conection
###
# the line is now loaded into R and anything can be done with it
print(linn)
}
close(con)
通过使用 R 的 pipe()
命令,并使用 shell 命令提取我们想要的内容,完整的文件永远不会加载到 R 中,而是逐行读取。
paste("head -n", format(x = i, scientific = FALSE, big.mark = ""), fileName, "| tail -n 1", sep = " ")
正是这个命令完成了所有的工作;它从所需的文件中提取一行。
编辑:R 的默认行为是 i
在 < 100,000 时返回正常数字,但开始以科学计数法返回 i
当它大于或等于 100,000 (1e+05) 时。因此,在我们的管道命令中使用 format(x = i, scientific = FALSE, big.mark = "")
以确保 pipe()
命令始终接收正常形式的数字,这是命令可以理解的全部内容。如果 pipe()
命令被赋予任何数字,如 1e+05,它将无法理解它并导致以下错误:
head: 1e+05: invalid number of lines