Question

我在Windows操作系统下有一个压缩的二进制文件，我试图用R读取。到目前为止，它使用unz（）函数与readBin（）函数结合使用。

> bin.con <- unz(zip_path, file_in_zip, open = 'rb')
> readBin(bin.con,
          "double", 
          n = byte_chunk, 
          size = 8L, 
          endian = "little")
> close(bin.con)

其中 zip_path 是zip文件的路径， file_in_zip 是要读取的zip文件中的文件名， byte_chunk 我想要读取的字节数。

在我的用例中，readBin操作是循环的一部分，并逐渐读取整个二进制文件。但是，我很少想阅读所有内容，而且我经常知道我想要阅读的部分。不幸的是，readBin没有跳过前n个字节的start / skip参数。因此，我尝试用seek（）有条件地替换readBin（），以跳过实际读取不需要的部分。

当我尝试这个时，我收到一个错误：

> bin.con <- unz(zip_path, file_in_zip, open = 'rb')
> seek(bin.con, where = bytes_to_skip, origin = 'current')
Error in seek.connection(bin.con, where = bytes_to_skip, origin = "current") : 
  seek not enabled for this connection
> close(bin.con)

到目前为止，我找不到解决此错误的方法。类似的问题可以在这里找到（遗憾的是没有令人满意的答案）：

https://stat.ethz.ch/pipermail/r-help/2007-December/148847.html（无回答）
http://r.789695.n4.nabble.com/reading-file-in-zip-archive-td4631853.html（没有答案，但可重现的例子）

互联网上的提示建议将open ='r'参数添加到 unz（）或完全删除open参数，但这仅适用于非二进制文件（因为默认为'r “）。人们还建议首先解压缩文件，但由于文件很大，这几乎是不可能的。

有没有办法解决二进制压缩文件或读取字节偏移量（可能通过Rcpp包使用C ++）？

更新：

进一步的研究似乎表明，zip文件中的 seek（）并不是一个容易的问题。 This question建议一个最多可以使用粗搜索的c ++库。 This Python question表示完全不可能由于zip的实现方式而完全不可能（尽管它与粗搜索方法不矛盾）。

Answer 1

这里有点可能对你有用的黑客攻击。这是一个虚假的二进制文件：

writeBin(as.raw(1:255), "file.bin")
readBin("file.bin", raw(1), n = 16)
#  [1] 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10

这里是生成的zip文件：

zip("file.zip", "file.bin")
#   adding: file.bin (stored 0%)
readBin("file.zip", raw(1), n = 16)
#  [1] 50 4b 03 04 0a 00 02 00 00 00 7b ab 45 4a 87 1f

这使用临时中间二进制文件。

system('sh -c "unzip -p file.zip file.bin | dd of=tempfile.bin bs=1c skip=5c count=4c"')
# 4+0 records in
# 4+0 records out
# 4 bytes copied, 0.00044964 s, 8.9 kB/s
file.info("tempfile.bin")$size
# [1] 4
readBin("tempfile.bin", raw(1), n = 16)
# [1] 06 07 08 09

这种方法抵消了费用＆＃34;在R。

中处理存储的二进制数据的大小到shell / pipe

这适用于win10，R-3.3.2。我使用Git for Windows中的dd（版本2.11.0.3，虽然2.11.1可用），以及来自RTools的unzip和sh。

Sys.which(c("dd", "unzip", "sh"))
#                                    dd 
# "C:\\PROGRA~1\\Git\\usr\\bin\\dd.exe" 
#                                 unzip 
#          "c:\\Rtools\\bin\\unzip.exe" 
#                                    sh 
#             "c:\\Rtools\\bin\\sh.exe"

从压缩文件中读取R中的二进制文件和已知的起始位置（字节偏移量）

1 个答案: