我有一个带有program
,dataset
,algorithm
和result
字段的Pandas DataFrame,其中result
表示正在运行的程序的运行时特定的算法和数据集。一些结果遗失了。我想在参考计划dataset
的相同algorithm
和Program-A
内填写这些缺失的结果。
我很乐意就如何改进代码提出任何建议。但我的具体问题是为什么我无法将DataFrame传递给fillna
的值参数,而是必须将其转换为dict。 (文档说value : scalar, dict, Series, or DataFrame
。)
col = ['program', 'dataset', 'algorithm', 'result']
df = pandas.DataFrame(
[['program-A', 'dataset-X', 'algorithm-i', 1],
['program-A', 'dataset-X', 'algorithm-j', 2],
['program-A', 'dataset-Y', 'algorithm-i', 3],
['program-A', 'dataset-Y', 'algorithm-j', 4],
['program-B', 'dataset-X', 'algorithm-j', numpy.NaN]
], columns=col)
df['algorithm_dataset'] = df['algorithm'] + "_" + df['dataset']
# build a dict from {algorithm+dataset} to result
dfg = df.loc[df['program'] == 'program-A'][['algorithm_dataset',
'result']]
dfg = dfg.set_index('algorithm_dataset')
dfg_dict = dfg.to_dict()['result']
df = df.set_index('algorithm_dataset')
# df['result'] = df['result'].fillna(value=dfg)
# what's above doesn't work:
# ValueError: invalid fill value with a <class 'pandas.core.frame.DataFrame'>
# so instead:
df['result'] = df['result'].fillna(value=dfg_dict)
df = df.reset_index()
print df
版本:
$ port installed | grep pandas
py27-pandas @0.19.1_0 (active)
$ python --version
Python 2.7.12
答案 0 :(得分:1)
如果需要使用Series
(dict
),您可以column
代替Series
使用fillna
:
ser = dfg.set_index('algorithm_dataset')['result']
print (ser)
algorithm_dataset
algorithm-i_dataset-X 1.0
algorithm-j_dataset-X 2.0
algorithm-i_dataset-Y 3.0
algorithm-j_dataset-Y 4.0
Name: result, dtype: float64
df = df.set_index('algorithm_dataset')
df['result1'] = df['result'].fillna(value=ser)
print (df)
program dataset algorithm result result1
algorithm_dataset
algorithm-i_dataset-X program-A dataset-X algorithm-i 1.0 1.0
algorithm-j_dataset-X program-A dataset-X algorithm-j 2.0 2.0
algorithm-i_dataset-Y program-A dataset-Y algorithm-i 3.0 3.0
algorithm-j_dataset-Y program-A dataset-Y algorithm-j 4.0 4.0
algorithm-j_dataset-X program-B dataset-X algorithm-j NaN 2.0
df['result'] = df['result'].fillna(value=ser)
print (df)
program dataset algorithm result
algorithm_dataset
algorithm-i_dataset-X program-A dataset-X algorithm-i 1.0
algorithm-j_dataset-X program-A dataset-X algorithm-j 2.0
algorithm-i_dataset-Y program-A dataset-Y algorithm-i 3.0
algorithm-j_dataset-Y program-A dataset-Y algorithm-j 4.0
algorithm-j_dataset-X program-B dataset-X algorithm-j 2.0
如果fillna
需要DataFrame
,则必须先创建另一个DataFrame
index
并使用相同的列,然后才能正常工作:
dfg = df.loc[df['program'] == 'program-A'][['algorithm_dataset',
'result']]
dfg = dfg.set_index('algorithm_dataset')['result'].to_frame()
print (dfg)
result
algorithm_dataset
algorithm-i_dataset-X 1.0
algorithm-j_dataset-X 2.0
algorithm-i_dataset-Y 3.0
algorithm-j_dataset-Y 4.0
df = df.set_index('algorithm_dataset')
df = df.drop(['program','dataset','algorithm'], axis=1)
print (df)
result
algorithm_dataset
algorithm-i_dataset-X 1.0
algorithm-j_dataset-X 2.0
algorithm-i_dataset-Y 3.0
algorithm-j_dataset-Y 4.0
algorithm-j_dataset-X NaN
dfg = dfg.reindex(df.index)
print (dfg)
result
algorithm_dataset
algorithm-i_dataset-X 1.0
algorithm-j_dataset-X 2.0
algorithm-i_dataset-Y 3.0
algorithm-j_dataset-Y 4.0
algorithm-j_dataset-X 2.0
df = df.fillna(dfg)
print (df)
lgorithm_dataset
algorithm-i_dataset-X 1.0
algorithm-j_dataset-X 2.0
algorithm-i_dataset-Y 3.0
algorithm-j_dataset-Y 4.0
algorithm-j_dataset-X 2.0