Question

我学习了8个月的python，新手到R，有一个二进制文件，我可以阅读
并将二进制数据更改为一个列表（在python中，数组是列表）数据文件（名为test）位于：
https://www.box.com/s/0g3qg2lqgmr7y7fk5aut
结构是：
每4个字节是一个整数，所以要在python

中使用unpack读取它

import struct
datafile=open('test','rb')
data=datafile.read(32)
result=[]
while  data:
    result.append(list(struct.unpack('iiiiiiii',data)))
    data=datafile.read(32)

如何读取R？

中的二进制数据

我从Paul Hiemstra帮助完成R中的代码中获益。

datafile="test"
totalsize=file.info(datafile)$size
lines=totalsize/32
data=readBin("test",integer(),n=totalsize,size=4,endian="little")
result=data.frame(matrix(data,nrow=lines,ncol=8,byrow=TRUE))
colnames(result)=c(date,"x1","x2","x3","x4","x5","x6","x7")

我还有待解决的问题。在这里，我用n = totalsize完全读取所有数据，如果数据很大，内存不足以包含，如何表达：从第1001到第2000字节读取数据？如果n = 1000，则表示从第1到第1000读取数据，如果n = 2000，则表示从第1到第2000读取数据，如何从第1001到第2000读取数据？ R中有文件指针吗？当我读取第1000个二进制数据时，文件指针位于第1000个位置，现在使用命令readBin（“test”，integer（），n = 1000，size = 4，endian =“little” ）从第1001到第2000读取数据？

Answer 1

Google搜索R read binary file会产生the following link作为第一个结果。最重要的是使用readBin函数，在您的情况下看起来像：

file2read = file("test", "rb")
number_of_integers_in_file = 128
spam = readBin(file2read, integer(), number_of_integers_in_file, size = 4)
close(file2read)

如果你不知道文件中的整数数，你可以做很多事情，先创建一个示例文件：

# Create a binary file that we can read
l = as.integer(1:10)
file2write = file("/tmp/test", "wb")
writeBin(l, file2write)
close(file2write)

一种策略是高估读取整数的数量readBin只会返回真正存在的数字。大小为n的向量已预先分配，因此请注意使其过大。

file2read = file("/tmp/test", "rb")
l_read = readBin(file2read, integer(), n = 100)
close(file2read)
all.equal(l, l_read)
[1] TRUE

或者，如果您知道尺寸，例如4个字节的数字，您可以使用我写的以下函数计算出现的数量：

number_of_numbers = function(path, size = 4) {
  # If path is a file connection, extract file name
  if(inherits(path, "file")) path = summary(path)[["description"]]
  return(file.info(path)[["size"]] / size)
 }
number_of_numbers("/tmp/test")
[1] 10

行动中：

file2read = file("/tmp/test", "rb")
l_read2 = readBin(file2read, integer(), n = number_of_numbers(file2read))
close(file2read)
all.equal(l, l_read2)   
[1] TRUE

如果数据量太大而无法存储在内存中，我建议您阅读大块内容。这可以使用readBin的连续调用来完成，例如：

first_1000 = readBin(con, integer(), n = 1000)
next_1000 = readBin(con, integer(), n = 1000)

如果要跳过部分数据文件，比如前1000个数字，请使用seek功能。这比读取1000个数字，丢弃这些数字和读取第二个1000数字要快得多。例如：

# Skip the first thousand 4 byte integers
seek(con, where = 4*1000)
next_1000 = readBin(con, integer(), n = 1000)

在R中读取二进制数据而不是在python中解压缩

我从Paul Hiemstra帮助完成R中的代码中获益。

1 个答案: