熊猫:使用带有数据帧的fillna作为值参数

时间:2016-11-25 06:05:09

标签: pandas

我有一个带有programdatasetalgorithmresult字段的Pandas DataFrame,其中result表示正在运行的程序的运行时特定的算法和数据集。一些结果遗失了。我想在参考计划dataset的相同algorithmProgram-A内填写这些缺失的结果。

我很乐意就如何改进代码提出任何建议。但我的具体问题是为什么我无法将DataFrame传递给fillna的值参数,而是必须将其转换为dict。 (文档说value : scalar, dict, Series, or DataFrame。)

col = ['program', 'dataset', 'algorithm', 'result']
df = pandas.DataFrame(
    [['program-A', 'dataset-X', 'algorithm-i', 1],
     ['program-A', 'dataset-X', 'algorithm-j', 2],
     ['program-A', 'dataset-Y', 'algorithm-i', 3],
     ['program-A', 'dataset-Y', 'algorithm-j', 4],
     ['program-B', 'dataset-X', 'algorithm-j', numpy.NaN]
     ], columns=col)

df['algorithm_dataset'] = df['algorithm'] + "_" + df['dataset']

# build a dict from {algorithm+dataset} to result
dfg = df.loc[df['program'] == 'program-A'][['algorithm_dataset',
                                            'result']]
dfg = dfg.set_index('algorithm_dataset')
dfg_dict = dfg.to_dict()['result']

df = df.set_index('algorithm_dataset')
# df['result'] = df['result'].fillna(value=dfg)
# what's above doesn't work:
# ValueError: invalid fill value with a <class 'pandas.core.frame.DataFrame'>
# so instead:
df['result'] = df['result'].fillna(value=dfg_dict)
df = df.reset_index()

print df

版本:

$ port installed | grep pandas
  py27-pandas @0.19.1_0 (active)
$ python --version
Python 2.7.12

1 个答案:

答案 0 :(得分:1)

如果需要使用Seriesdict),您可以column代替Series使用fillna

ser = dfg.set_index('algorithm_dataset')['result']
print (ser)
algorithm_dataset
algorithm-i_dataset-X    1.0
algorithm-j_dataset-X    2.0
algorithm-i_dataset-Y    3.0
algorithm-j_dataset-Y    4.0
Name: result, dtype: float64

df = df.set_index('algorithm_dataset')
df['result1'] = df['result'].fillna(value=ser)
print (df)
                         program    dataset    algorithm  result  result1
algorithm_dataset                                                        
algorithm-i_dataset-X  program-A  dataset-X  algorithm-i     1.0      1.0
algorithm-j_dataset-X  program-A  dataset-X  algorithm-j     2.0      2.0
algorithm-i_dataset-Y  program-A  dataset-Y  algorithm-i     3.0      3.0
algorithm-j_dataset-Y  program-A  dataset-Y  algorithm-j     4.0      4.0
algorithm-j_dataset-X  program-B  dataset-X  algorithm-j     NaN      2.0
df['result'] = df['result'].fillna(value=ser)
print (df)
                         program    dataset    algorithm  result
algorithm_dataset                                               
algorithm-i_dataset-X  program-A  dataset-X  algorithm-i     1.0
algorithm-j_dataset-X  program-A  dataset-X  algorithm-j     2.0
algorithm-i_dataset-Y  program-A  dataset-Y  algorithm-i     3.0
algorithm-j_dataset-Y  program-A  dataset-Y  algorithm-j     4.0
algorithm-j_dataset-X  program-B  dataset-X  algorithm-j     2.0

如果fillna需要DataFrame,则必须先创建另一个DataFrame index并使用相同的列,然后才能正常工作:

dfg = df.loc[df['program'] == 'program-A'][['algorithm_dataset',
                                            'result']]

dfg = dfg.set_index('algorithm_dataset')['result'].to_frame()
print (dfg)
                       result
algorithm_dataset            
algorithm-i_dataset-X     1.0
algorithm-j_dataset-X     2.0
algorithm-i_dataset-Y     3.0
algorithm-j_dataset-Y     4.0

df = df.set_index('algorithm_dataset')
df = df.drop(['program','dataset','algorithm'], axis=1)
print (df)
                       result
algorithm_dataset            
algorithm-i_dataset-X     1.0
algorithm-j_dataset-X     2.0
algorithm-i_dataset-Y     3.0
algorithm-j_dataset-Y     4.0
algorithm-j_dataset-X     NaN

dfg = dfg.reindex(df.index)
print (dfg)
                       result
algorithm_dataset            
algorithm-i_dataset-X     1.0
algorithm-j_dataset-X     2.0
algorithm-i_dataset-Y     3.0
algorithm-j_dataset-Y     4.0
algorithm-j_dataset-X     2.0
df = df.fillna(dfg)
print (df)
lgorithm_dataset            
algorithm-i_dataset-X     1.0
algorithm-j_dataset-X     2.0
algorithm-i_dataset-Y     3.0
algorithm-j_dataset-Y     4.0
algorithm-j_dataset-X     2.0