Question

有没有办法在不导入文件的情况下获取文件中的行数？

到目前为止，这就是我正在做的事情

myfiles <- list.files(pattern="*.dat")
myfilesContent <- lapply(myfiles, read.delim, header=F, quote="\"")
for (i in 1:length(myfiles)){
  test[[i]] <- length(myfilesContent[[i]]$V1)
}

但由于每个文件都很大，所以太费时了。

Answer 1

您可以在文件中计算换行符的数量（\n，也适用于Windows上的\r\n）。这将给你一个正确答案iff：

最后一行末尾有一个新行字符（BTW，read.csv会发出警告，如果这不存在）
该表格在数据中不包含换行符（例如在引号内）

我只需要部分阅读文件即可。下面我设置chunk（tmp buf）大小为65536字节：

f <- file("filename.csv", open="rb")
nlines <- 0L
while (length(chunk <- readBin(f, "raw", 65536)) > 0) {
   nlines <- nlines + sum(chunk == as.raw(10L))
}
print(nlines)
close(f)

基准 512 MB ASCII文本文件，12101000文本行，Linux：

readBin：ca。 2.4 s。
@ luis_js＆＃39; s wc - 基于解决方案：0.1秒。
read.delim：39.6 s。
编辑：使用readLines（f <- file("/tmp/test.txt", open="r"); nlines <- 0L; while (length(l <- readLines(f, 128)) > 0) nlines <- nlines + length(l); close(f)）逐行读取文件：32.0秒。

Answer 2

如果你：

仍然希望避免system2("wc"…将导致
在BSD / Linux或OS X上（我没有在Windows上测试以下内容）
不介意使用完整的文件名路径
使用inline包

那么下面的内容应该尽可能快（它几乎是内联R C函数中wc的'行数'部分）：

library(inline)

wc.code <- "
uintmax_t linect = 0; 
uintmax_t tlinect = 0;

int fd, len;
u_char *p;

struct statfs fsb;

static off_t buf_size = SMALL_BUF_SIZE;
static u_char small_buf[SMALL_BUF_SIZE];
static u_char *buf = small_buf;

PROTECT(f = AS_CHARACTER(f));

if ((fd = open(CHAR(STRING_ELT(f, 0)), O_RDONLY, 0)) >= 0) {

  if (fstatfs(fd, &fsb)) {
    fsb.f_iosize = SMALL_BUF_SIZE;
  }

  if (fsb.f_iosize != buf_size) {
    if (buf != small_buf) {
      free(buf);
    }
    if (fsb.f_iosize == SMALL_BUF_SIZE || !(buf = malloc(fsb.f_iosize))) {
      buf = small_buf;
      buf_size = SMALL_BUF_SIZE;
    } else {
      buf_size = fsb.f_iosize;
    }
  }

  while ((len = read(fd, buf, buf_size))) {

    if (len == -1) {
      (void)close(fd);
      break;
    }

    for (p = buf; len--; ++p)
      if (*p == '\\n')
        ++linect;
  }

  tlinect += linect;

  (void)close(fd);

}
SEXP result;
PROTECT(result = NEW_INTEGER(1));
INTEGER(result)[0] = tlinect;
UNPROTECT(2);
return(result);
";

setCMethod("wc",
           signature(f="character"), 
           wc.code,
           includes=c("#include <stdlib.h>", 
                      "#include <stdio.h>",
                      "#include <sys/param.h>",
                      "#include <sys/mount.h>",
                      "#include <sys/stat.h>",
                      "#include <ctype.h>",
                      "#include <err.h>",
                      "#include <errno.h>",
                      "#include <fcntl.h>",
                      "#include <locale.h>",
                      "#include <stdint.h>",
                      "#include <string.h>",
                      "#include <unistd.h>",
                      "#include <wchar.h>",
                      "#include <wctype.h>",
                      "#define SMALL_BUF_SIZE (1024 * 8)"),
           language="C",
           convention=".Call")

wc("FULLPATHTOFILE")

它作为一个包更好，因为它实际上必须首次编译。但是，如果真的需要“速度”，它就在这里供参考。对于我躺在的189,955行文件，我得到（来自一堆运行的平均值）：

   user  system elapsed 
  0.007   0.003   0.010

Answer 3

如果您使用的是Linux，这可能对您有用：

# total lines on a file through system call to wc, and filtering with awk
target_file   <- "your_file_name_here"
total_records <- as.integer(system2("wc",
                                    args = c("-l",
                                             target_file,
                                             " | awk '{print $1}'"),
                                    stdout = TRUE))

在你的情况下：

#
lapply(myfiles, function(x){
                         as.integer(system2("wc",
                                            args = c("-l",
                                                     x,
                                                     " | awk '{print $1}'"),
                                            stdout = TRUE))
                      }
                  )

Answer 4

也许我错过了一些东西，但通常我是在ReadLines上使用长度来做的：

for(i in length(list.level)){
  list.level[i].sub$SaleAmount <- SellList[1,i]
}

至少这与我的许多案例有关。我认为它有点快，它只会在不导入文件的情况下创建与文件的连接。

Answer 5

我使用R.utils包

找到了这种简单的方法

library(R.utils)
sapply(myfiles,countLines)

here is how it works

Answer 6

这是 CRAN 包 fpeek、函数 peek_count_lines 的另一种方式。这个函数是用 C++ 编写的，速度非常快。

library(fpeek)
sapply(filenames, peek_count_lines)

使用R获取文本文件中的行数

6 个答案: