当我尝试将保存在pandas中的数据帧加载为R中的HDF5文件时,我收到此警告消息:
警告消息:在H5Dread中(h5dataset = h5dataset,h5spaceFile = h5spaceFile,h5spaceMem = h5spaceMem,:由整数生成的NAs 转换64位整数或无符号32位整数时溢出 从HDF5到R中的32位整数。选择bit64conversion =' bit64'要么 bit64conversion ='双'避免数据丢失并看到小插图 ' rhdf5'有关64位整数的更多详细信息。
例如,如果我用pandas创建HDF5文件:
import pandas as pd
frame = pd.DataFrame({
'time':[1234567001,1234515616515167005],
'X2':[23.88,23.96]
},columns=['time','X2'])
store = pd.HDFStore('a.hdf5')
store['df'] = frame
store.close()
print(frame)
返回:
time X2
0 1234567001 23.88
1 1234515616515167005 23.96
并尝试在R:
中加载它#source("http://bioconductor.org/biocLite.R")
#biocLite("rhdf5")
library(rhdf5)
loadhdf5data <- function(h5File) {
# Function taken from [How can I load a data frame saved in pandas as an HDF5 file in R?](https://stackoverflow.com/a/45024089/395857)
listing <- h5ls(h5File)
# Find all data nodes, values are stored in *_values and corresponding column
# titles in *_items
data_nodes <- grep("_values", listing$name)
name_nodes <- grep("_items", listing$name)
data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")
columns = list()
for (idx in seq(data_paths)) {
print(idx)
data <- data.frame(t(h5read(h5File, data_paths[idx])))
names <- t(h5read(h5File, name_paths[idx], bit64conversion='bit64'))
#names <- t(h5read(h5File, name_paths[idx], bit64conversion='double'))
entry <- data.frame(data)
colnames(entry) <- names
columns <- append(columns, entry)
}
data <- data.frame(columns)
return(data)
}
frame = loadhdf5data("a.hdf5")
我收到此警告消息:
> frame = loadhdf5data("a.hdf5")
[1] 1
[1] 2
Warning message:
In H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem, :
NAs produced by integer overflow while converting 64-bit integer or unsigned 32-bit integer from HDF5 to a 32-bit integer in R. Choose bit64conversion='bit64' or bit64conversion='double' to avoid data loss and see the vignette 'rhdf5' for more details about 64-bit integers.
我可以看到其中一个时间值变为NA:
> frame
X2 time
1 23.88 1234567001
2 23.96 NA
如何解决此问题?选择bit64conversion='bit64'
或bit64conversion='double'
并不会改变任何内容。
> R.version
_
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 3
minor 4.0
year 2017
month 04
day 21
svn rev 72570
language R
version.string R version 3.4.0 (2017-04-21)
nickname You Stupid Darkness
答案 0 :(得分:1)
HDF5 Dataset Interface's documentation说:
bit64conversion:定义如何转换64位整数。在内部,R不支持64位整数。 R中的所有整数都是32位整数。通过设置bit64conversion =&#39; int&#39;,强制执行强制转换为32位整数,数据丢失的风险,但保证数字表示为整数。 bit64conversion =&#39;双&#39;将64位整数强制转换为浮点数。双精度数可以表示最多54位的整数,但它们不再表示为整数值。对于较大的数字,再次存在数据丢失。 bit64conversion =&#39; bit64&#39;是推荐的强制方式。它将64位整数表示为类&#39;整数64&#39;的对象。按照包#64; bit64&#39;中的定义。确保您已安装&#39; bit64&#39;。数据类型&#39;整数64&#39;不是基本R的一部分,而是在外部包中定义。处理数据时,这会产生意外行为。
因此,您应该安装bit64(install.packages("bit64")
)并加载它(library(bit64)
)。您可以检查是否已加载integer64
:
> integer64
Function (length = 0)
{
ret <- double(length)
oldClass(ret) <- "integer64"
ret
}
<bytecode: 0x000000001a7a95f0>
<environment: namespace :it64>
现在你可以运行:
library(bit64)
library(rhdf5)
loadhdf5data <- function(h5File) {
listing <- h5ls(h5File)
# Find all data nodes, values are stored in *_values and corresponding column
# titles in *_items
data_nodes <- grep("_values", listing$name)
name_nodes <- grep("_items", listing$name)
data_paths = paste(listing$group[data_nodes], listing$name[data_nodes], sep = "/")
name_paths = paste(listing$group[name_nodes], listing$name[name_nodes], sep = "/")
columns = list()
for (idx in seq(data_paths)) {
print(idx)
data <- data.frame(t(h5read(h5File, data_paths[idx], bit64conversion='bit64')))
names <- t(h5read(h5File, name_paths[idx], bit64conversion='bit64'))
entry <- data.frame(data)
colnames(entry) <- names
columns <- append(columns, entry)
}
data <- data.frame(columns)
return(data)
}
frame = loadhdf5data("a.hdf5")
给出:
> frame
X2 time
1 23.88 1234567001
2 23.96 1234515616515167005