Question

我正在尝试阅读存储为json文件的推文。我正在使用pandas来加载数据。但在read_json函数中发现了一些奇怪的行为。我在下面提供mcve：

json_content="""
{ 
    "1": {
        "tid": "9999999999999998", 
    }, 
    "2": {
        "tid": "9999999999999999", 
    },
    "3": {
        "tid": "10000000000000001", 
    },
    "4": {
        "tid": "10000000000000002", 
    }
}
"""
df=pd.read_json(json_content,
                orient='index', # read as transposed
                convert_axes=False, # don't convert keys to dates
        )
print(df.info())
print(df)

在我的电脑上输出以下内容：

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
tid    4 non-null int64
dtypes: int64(1)
memory usage: 64.0+ bytes
None
                 tid
1   9999999999999998
2  10000000000000000
3  10000000000000000
4  10000000000000002

哪个没有存储tid列的正确值，为什么会这样发生？

注意：不应该有an overflow case。 tid列存储为int64，其限制大约比我最初测试的tid高10倍（见下文）：

import sys
# original problem 
tid_0 = 956677215197970432 
print(sys.maxsize,tid_0,sys.maxsize/tid_0)    # < 1 if overflow possible
# minimal case
tid = 10000000000000001 
print(sys.maxsize,tid,sys.maxsize/tid)    # < 1 if overflow possible

#Output
9223372036854775807 956677215197970432 9
9223372036854775807 10000000000000001 922

更新：

在明确指定参数时正确读取 dtype=int，但我不明白为什么。我们指定时会发生什么变化 dtype？

Answer 1

您可以明确指定dtype：

In [32]: df=pd.read_json(json_content,
    ...:                 orient='index', # read as transposed
    ...:                 convert_axes=False, # don't convert keys to dates
    ...:                 dtype='int64'   # <------- NOTE
    ...:         )
    ...: print(df.info())
    ...: print(df)
    ...:
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
tid    4 non-null int64
dtypes: int64(1)
memory usage: 64.0+ bytes
None
                 tid
1   9999999999999998
2   9999999999999999
3  10000000000000001
4  10000000000000002

如果我们在JSON中指定整数而不是字符串值，它也会按预期工作：

In [61]: %paste
json_content="""
{
    "1": {
        "tid": 9999999999999998,
    },
    "2": {
        "tid": 9999999999999999,
    },
    "3": {
        "tid": 10000000000000001,
    },
    "4": {
        "tid": 10000000000000002,
    }
}
"""

df=pd.read_json(json_content,
                orient='index', # read as transposed
                convert_axes=False, # don't convert keys to dates
        )
print(df.dtypes)
print(df)

## -- End pasted text --
tid    int64
dtype: object
                 tid
1   9999999999999998
2   9999999999999999
3  10000000000000001
4  10000000000000002

所以看起来它与类型推断有关，因为默认为dtype=True，这意味着：If True, infer dtypes

pandas read_json错误地将大整数读成

1 个答案: