熊猫:将数据帧稀疏到字典而没有nan值

时间:2018-10-25 06:54:16

标签: python pandas

我有一个稀疏的数据帧sdf,其中主要包含NaN。当我使用sdf.to_dict()时,它将输出该矩阵的密集版本,其中所有null值均已填充。我该如何省略那些NaN条目,而只有输出条目确实对dict有价值?

例如,sdf是:

          2018-02-02  2018-02-03
23:58:36         NaN         NaN
23:58:37         1.0         NaN
23:58:40         NaN         NaN
23:58:41         NaN         NaN
23:58:42         NaN         NaN
23:58:43         NaN         NaN
23:58:48         NaN         NaN
23:58:49         NaN         NaN
23:58:50         NaN         NaN
23:58:52         NaN         1.0
23:58:59         NaN         NaN
23:59:00         NaN         NaN
23:59:01         NaN         NaN
23:59:05         NaN         NaN
23:59:07         NaN         NaN

stf.to_dict()会给出:

{'2018-02-02': {'23:58:36': nan, '23:58:37': 1.0, '23:58:40':
  nan, '23:58:41': nan, '23:58:42': nan, '23:58:43': nan,
  '23:58:48': nan, '23:58:49': nan, '23:58:50': nan, '23:58:52':
  nan, '23:58:59': nan, '23:59:00': nan, '23:59:01': nan,
  '23:59:05': nan, '23:59:07': nan}, '2018-02-03': {'23:58:36':
  nan, '23:58:37': nan, '23:58:40': nan, '23:58:41': nan,
  '23:58:42': nan, '23:58:43': nan, '23:58:48': nan, '23:58:49':
  nan, '23:58:50': nan, '23:58:52': 1.0, '23:58:59': nan,
  '23:59:00': nan, '23:59:01': nan, '23:59:05': nan, '23:59:07':
  nan}}

即使sdf是一个稀疏的数据帧。


很抱歉含糊。我要保留所有非NaN条目。所需的输出是

{'2018-02-02': {'23:58:37': 1.0}, '2018-02-03': {'23:58:52': 1.0}}

3 个答案:

答案 0 :(得分:1)

改编this答案将完全满足您的要求

from math import isnan

sdd = sdf.dropna(how = 'all').to_dict()
clean_dict = {k: {j: sdd[k][j] for j in sdd[k] if not isnan(sdd[k][j])} for k in sdd}

答案 1 :(得分:1)

stackdict comprehension一起使用:

from collections import defaultdict
d = defaultdict(dict)
for (k1, k2), v in df.stack().items():
    d[k2][k1] = v

d1 = dict(d)

如果输入是SeriesDatetimeIndex

print (s)
2018-02-02 23:58:37    1.0
2018-02-03 23:58:52    1.0
dtype: float64

from collections import defaultdict
d = defaultdict(dict)
for k, v in df.stack().items():
    d[k.strftime('%Y-%m-%d')][k.strftime('%H:%M:%S')] = v

d1 = dict(d)

答案 2 :(得分:0)

到目前为止,对我来说这是最好的方法。

from pandas import isnull

[{k:i for k, i in row.iteritems() if not isnull(i)} for c, row in df.iterrows()]