Question

我有一个简单的二进制结构，有一些重复的数据类型，我需要在 R 中有效地读取。例如，整数icount，后跟结构{a integer, b real}，重复icount次。例如，考虑一下Python编写的这个简单文件：

# Python -- this is not my question, it just makes data for my question
from struct import pack
with open('foo.bin', 'wb') as fp:
    icount = 123456
    fp.write(pack('i', icount))
    for i in range(icount):
        fp.write(pack('if', i, i * 100.0))

（如果您不想生成它，可以download this <1 MB file。）

要将此文件读入 R ，我可以在for循环中使用readBin，但痛苦地慢（正如预期的那样）：

# R
fp <- file("foo.bin", "rb")
icount <- readBin(fp, "integer", size=4)
df <- data.frame(a=integer(icount), b=numeric(icount))
for (i in seq(icount)) {
    df$a[i] <- readBin(fp, "integer", size=4)
    df$b[i] <- readBin(fp, "numeric", size=4)
}
close(fp)

我想知道将非均匀二进制结构读入data.frame结构（或类似结构）的更有效方法。我知道如果可能的话，应该始终避免使用for循环。

Answer 1

作为一个注释，你的循环（仅测试一次）：

   user  system elapsed 
 174.04    1.55  180.96

我使用以下方式加快了阅读速度：

fp <- file("foo.bin", "rb")
icount <- readBin(fp, "integer", size=4)
df <- data.frame(a=integer(icount), b=numeric(icount))
x=replicate(icount*2,readBin(fp, "integer", size=4))
x=x[0:(icount-1)*2+1]
close(fp)
fp <- file("foo.bin", "rb")
y=replicate(icount*2+1,readBin(fp, "numeric", size=4))
y=y[1:(icount)*2+1]
df$a=x
df$b=y
close(fp)

比我预期的要快：

user  system elapsed 
3.08    0.10    3.18

Answer 2

我找到了一个快速运行的解决方法，即将整个结构数据块读作“原始”，然后将这些部分切片以解释结构。让我演示一下：

text
text2

棘手的部分是让fp <- file("foo.bin", "rb") icount <- readBin(fp, "integer", size=4) rec_size = 4 + 4 # int is 4 bytes + float is 4 bytes raw <- readBin(fp, "raw", n=icount * rec_size) close(fp) # Interpret raw bytes using specifically tailored slices for the structure raw_sel_a <- rep(0:icount, each=4) * rec_size + 1:4 raw_sel_b <- rep(0:icount, each=4) * rec_size + 1:4 + 4 df <- data.frame( a = readBin(raw[raw_sel_a], "integer", size=4, n=icount), b = readBin(raw[raw_sel_b], "numeric", size=4, n=icount))切片原始结构的相关部分进行读取。这个例子很简单，因为每个数据成员都是4个字节。但是，我可以想象这对于复杂的数据结构来说更加困难。

读取R中的二进制结构

2 个答案: