Question

我有一个大小约为1 GB的CSV文件，由于我的笔记本电脑是基本配置，我无法在Excel或R中打开文件。但出于好奇，我想得到的数量文件中的行。如果我能做到的话，我该怎么做呢？

Answer 1

对于Linux / Unix：

wc -l filename

对于Windows：

find /c /v "A String that is extremely unlikely to occur" filename

Answer 2

选项1：

通过文件连接，count.fields()根据某些sep值（我们在此不关心）计算文件每行的字段数。因此，如果我们取结果的长度，理论上我们应该得到文件中的行数（和行数）。

length(count.fields(filename))

如果您有标题行，可以使用skip = 1

跳过它

length(count.fields(filename, skip = 1))

您可以根据自己的特定需求调整其他参数，例如跳过空行。

args(count.fields)
# function (file, sep = "", quote = "\"'", skip = 0, blank.lines.skip = TRUE, 
#     comment.char = "#") 
# NULL

有关详情，请参阅help(count.fields)。

就速度而言，这并不算太糟糕。我在我的一个包含99846行的棒球文件上测试了它。

nrow(data.table::fread("Batting.csv"))
# [1] 99846

system.time({ l <- length(count.fields("Batting.csv", skip = 1)) })
#   user  system elapsed 
#  0.528   0.000   0.503 

l
# [1] 99846
file.info("Batting.csv")$size
# [1] 6153740

（效率更高）选项2：另一个想法是使用data.table::fread()仅读取第一列，然后获取行数。这将非常快。

system.time(nrow(fread("Batting.csv", select = 1L)))
#   user  system elapsed 
#  0.063   0.000   0.063

Answer 3

这是我用过的东西：

testcon <- file("xyzfile.csv",open="r")
readsizeof <- 20000
nooflines <- 0
( while((linesread <- length(readLines(testcon,readsizeof))) > 0 ) 
nooflines <- nooflines+linesread )
close(testcon)
nooflines

查看此帖子了解更多信息： https://www.r-bloggers.com/easy-way-of-determining-number-of-linesrecords-in-a-given-large-file-using-r/

Answer 4

根据前1000行的大小估算行数

size1000  <- sum(nchar(readLines(con = "dgrp2.tgeno", n = 1000)))

sizetotal <- file.size("dgrp2.tgeno")
1000 *  sizetotal / size1000

对于大多数用途而言，这通常已经足够好-对于大文件来说，速度要快得多。

是否可以在不打开CSV文件的情况下获取行数？

4 个答案: