Question

我维护一个R软件包，该软件包需要单独检查许多小文件的存在。重复调用NULL会产生明显的缓慢（benchmarking results here）。不幸的是，情境限制使我无法以矢量化方式对整批文件调用file.exists()，我相信这样会更快。有没有更快的方法来检查单个文件的存在？也许在C中？在我的系统上，这种方式似乎没有更快的速度（与产生these benchmarks的系统一样）：

file.exists()

^{由reprex package（v0.3.0）于2019-12-06创建}

编辑：library(inline) library(microbenchmark) body <- " FILE *fp = fopen(CHAR(STRING_ELT(r_path, 0)), \"r\"); SEXP result = PROTECT(allocVector(INTSXP, 1)); INTEGER(result)[0] = fp == NULL? 0 : 1; UNPROTECT(1); return result; " file_exists_c <- cfunction(sig = signature(r_path = "character"), body = body) tmp <- tempfile() microbenchmark( c = file_exists_c(tmp), r = file.exists(tmp) ) #> Unit: microseconds #> expr min lq mean median uq max neval #> c 4.912 5.0230 5.42443 5.0605 5.1240 25.264 100 #> r 3.972 4.0525 4.32615 4.1835 4.2675 11.750 100 file.create(tmp) #> [1] TRUE microbenchmark( c = file_exists_c(tmp), r = file.exists(tmp) ) #> Unit: microseconds #> expr min lq mean median uq max neval #> c 16.212 16.6245 17.04727 16.7645 16.9860 32.207 100 #> r 6.242 6.4175 7.16057 7.2830 7.4605 26.781 100

access()的确确实要快一些，但速度不是很快。

access()

^{由reprex package（v0.3.0）于2019-12-07创建}

Answer 1

这里是file.exists源代码的全部（截至撰写本文时）：

https://github.com/wch/r-source/blob/bfe73ecd848198cb9b68427cec7e70c40f96bd72/src/main/platform.c#L1375-L1404

SEXP attribute_hidden do_fileexists(SEXP call, SEXP op, SEXP args, SEXP rho)
{
    SEXP file, ans;
    int i, nfile;
    checkArity(op, args);
    if (!isString(file = CAR(args)))
    error(_("invalid '%s' argument"), "file");
    nfile = LENGTH(file);
    ans = PROTECT(allocVector(LGLSXP, nfile));
    for (i = 0; i < nfile; i++) {
    LOGICAL(ans)[i] = 0;
    if (STRING_ELT(file, i) != NA_STRING) {
#ifdef Win32
        /* Package XML sends arbitrarily long strings to file.exists! */
        size_t len = strlen(CHAR(STRING_ELT(file, i)));
        if (len > MAX_PATH)
        LOGICAL(ans)[i] = FALSE;
        else
        LOGICAL(ans)[i] =
            R_WFileExists(filenameToWchar(STRING_ELT(file, i), TRUE));
#else
        // returns NULL if not translatable
        const char *p = translateCharFP2(STRING_ELT(file, i));
        LOGICAL(ans)[i] = p && R_FileExists(p);
#endif
    } else LOGICAL(ans)[i] = FALSE;
    }
    UNPROTECT(1); /* ans */
    return ans;
}

关于R_FileExists，在这里：

https://github.com/wch/r-source/blob/bfe73ecd848198cb9b68427cec7e70c40f96bd72/src/main/sysutils.c#L60-L79

#ifdef Win32
Rboolean R_FileExists(const char *path)
{
    struct _stati64 sb;
    return _stati64(R_ExpandFileName(path), &sb) == 0;
}
#else
Rboolean R_FileExists(const char *path)
{
    struct stat sb;
    return stat(R_ExpandFileName(path), &sb) == 0;
}

（{R_ExpandFileName仅在执行path.expand）。它依赖于stat系统实用程序：

https://en.wikipedia.org/wiki/Stat_(system_call)

https://pubs.opengroup.org/onlinepubs/007908799/xsh/sysstat.h.html

它是为向量化输入而构建的，因此如上所述，file.exists(vector_of_files)比重复运行file.exists(single_file)更为可取。

据我所知（诚然，我不是系统实用程序的专家），任何效率的提高都是以健壮性为代价的。

Answer 2

在C语言中一个简单的解决方案是使用access（file name，0）;如果函数返回0，则文件存在。第二个参数0仅指定检查是否存在。示例：我在/ test目录中检查文件test.txt

#include "io.h"
#include "stdio.h"

int main()
{
 if(!access("/test/test.txt",0)) printf("file exists");
}

替代file.exists（）的更快方法

2 个答案: