我正在尝试阅读存储为json文件的推文。我正在使用pandas来加载数据。但在read_json
函数中发现了一些奇怪的行为。我在下面提供mcve:
json_content="""
{
"1": {
"tid": "9999999999999998",
},
"2": {
"tid": "9999999999999999",
},
"3": {
"tid": "10000000000000001",
},
"4": {
"tid": "10000000000000002",
}
}
"""
df=pd.read_json(json_content,
orient='index', # read as transposed
convert_axes=False, # don't convert keys to dates
)
print(df.info())
print(df)
在我的电脑上输出以下内容:
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
tid 4 non-null int64
dtypes: int64(1)
memory usage: 64.0+ bytes
None
tid
1 9999999999999998
2 10000000000000000
3 10000000000000000
4 10000000000000002
哪个没有存储
tid
列的正确值,为什么会这样 发生?
注意:不应该有an overflow case。 tid
列存储为int64,其限制大约比我最初测试的tid高10倍(见下文):
import sys
# original problem
tid_0 = 956677215197970432
print(sys.maxsize,tid_0,sys.maxsize/tid_0) # < 1 if overflow possible
# minimal case
tid = 10000000000000001
print(sys.maxsize,tid,sys.maxsize/tid) # < 1 if overflow possible
#Output
9223372036854775807 956677215197970432 9
9223372036854775807 10000000000000001 922
更新:
在明确指定参数时正确读取
dtype=int
,但我不明白为什么。我们指定时会发生什么变化 dtype?
答案 0 :(得分:1)
您可以明确指定dtype:
In [32]: df=pd.read_json(json_content,
...: orient='index', # read as transposed
...: convert_axes=False, # don't convert keys to dates
...: dtype='int64' # <------- NOTE
...: )
...: print(df.info())
...: print(df)
...:
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
tid 4 non-null int64
dtypes: int64(1)
memory usage: 64.0+ bytes
None
tid
1 9999999999999998
2 9999999999999999
3 10000000000000001
4 10000000000000002
如果我们在JSON中指定整数而不是字符串值,它也会按预期工作:
In [61]: %paste
json_content="""
{
"1": {
"tid": 9999999999999998,
},
"2": {
"tid": 9999999999999999,
},
"3": {
"tid": 10000000000000001,
},
"4": {
"tid": 10000000000000002,
}
}
"""
df=pd.read_json(json_content,
orient='index', # read as transposed
convert_axes=False, # don't convert keys to dates
)
print(df.dtypes)
print(df)
## -- End pasted text --
tid int64
dtype: object
tid
1 9999999999999998
2 9999999999999999
3 10000000000000001
4 10000000000000002
所以看起来它与类型推断有关,因为默认为dtype=True
,这意味着:If True, infer dtypes