pandas read_json错误地将大整数读成

时间:2018-04-03 09:51:11

标签: python json python-3.x pandas

我正在尝试阅读存储为json文件的推文。我正在使用pandas来加载数据。但在read_json函数中发现了一些奇怪的行为。我在下面提供mcve

json_content="""
{ 
    "1": {
        "tid": "9999999999999998", 
    }, 
    "2": {
        "tid": "9999999999999999", 
    },
    "3": {
        "tid": "10000000000000001", 
    },
    "4": {
        "tid": "10000000000000002", 
    }
}
"""
df=pd.read_json(json_content,
                orient='index', # read as transposed
                convert_axes=False, # don't convert keys to dates
        )
print(df.info())
print(df)

在我的电脑上输出以下内容:

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
tid    4 non-null int64
dtypes: int64(1)
memory usage: 64.0+ bytes
None
                 tid
1   9999999999999998
2  10000000000000000
3  10000000000000000
4  10000000000000002
  

哪个没有存储tid列的正确值,为什么会这样   发生?

注意:不应该有an overflow casetid列存储为int64,其限制大约比我最初测试的tid高10倍(见下文):

import sys
# original problem 
tid_0 = 956677215197970432 
print(sys.maxsize,tid_0,sys.maxsize/tid_0)    # < 1 if overflow possible
# minimal case
tid = 10000000000000001 
print(sys.maxsize,tid,sys.maxsize/tid)    # < 1 if overflow possible

#Output
9223372036854775807 956677215197970432 9
9223372036854775807 10000000000000001 922

更新

  

在明确指定参数时正确读取   dtype=int,但我不明白为什么。我们指定时会发生什么变化   dtype?

1 个答案:

答案 0 :(得分:1)

您可以明确指定dtype:

In [32]: df=pd.read_json(json_content,
    ...:                 orient='index', # read as transposed
    ...:                 convert_axes=False, # don't convert keys to dates
    ...:                 dtype='int64'   # <------- NOTE
    ...:         )
    ...: print(df.info())
    ...: print(df)
    ...:
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
tid    4 non-null int64
dtypes: int64(1)
memory usage: 64.0+ bytes
None
                 tid
1   9999999999999998
2   9999999999999999
3  10000000000000001
4  10000000000000002

如果我们在JSON中指定整数而不是字符串值,它也会按预期工作:

In [61]: %paste
json_content="""
{
    "1": {
        "tid": 9999999999999998,
    },
    "2": {
        "tid": 9999999999999999,
    },
    "3": {
        "tid": 10000000000000001,
    },
    "4": {
        "tid": 10000000000000002,
    }
}
"""

df=pd.read_json(json_content,
                orient='index', # read as transposed
                convert_axes=False, # don't convert keys to dates
        )
print(df.dtypes)
print(df)

## -- End pasted text --
tid    int64
dtype: object
                 tid
1   9999999999999998
2   9999999999999999
3  10000000000000001
4  10000000000000002

所以看起来它与类型推断有关,因为默认为dtype=True,这意味着:If True, infer dtypes