我想通过迭代遍历列表(A_list)生成字典的方式来填充数据帧(df)的列,该字典中的键是df所需列的名称(在下面的示例中,新列为'C' ,“ D”和“ E”)注意:我无法控制gen_data的输出,它将返回字典,其中键是列名,值是列值。
import pandas
def gen_data(key):
#EXAMPLE FUNCTIONS THESE COULD BE ANYTHING AND NOT NECESSARY RELATED TO OTHER COLUMNS
data_dict = {'C':key+key, 'D':key, 'E':key+key+key}
return data_dict
A_list = ['a', 'b', 'c', 'd', 'f']
df = pandas.DataFrame({'A': ['a', 'b', 'c', 'd', 'e', 'f'], 'B': [1,2,3,3,2]})
for A_value in A_list:
data_dict = gen_data(A_value)
for data_key in data_dict:
df.loc[df.A == A_value, data_key] = data_dict[key]
因此,结果应为:
df = pandas.DataFrame({'A': ['a', 'b', 'c', 'd', 'e','f'],
'B': [1,2,3,3,2,1],
'C': ['aa','bb','cc','dd',nan,'ff'],
'D': ['a', 'b', 'c', 'd', nan,'f'],
'E': ['aaa','bbb','ccc','ddd',nan,'fff']})
我觉得
for data_key in data_dict:
df.loc[df.A == A_value, data_key] = data_dict[key]
如果df中有很多行,的效率真的很低,我觉得应该有一种方法可以删除此代码中的for循环。
for A_value in A_list:
data_dict = gen_data(A_value)
for data_key in data_dict:
df.loc[df.A == key, data_key] = data_dict[key]
答案 0 :(得分:0)
我进行了实验,发现用下面的for
块替换try except
循环可将计算速度提高三分之一。 except用于第一个循环,用新列填充数据框,否则将出现不匹配错误。仍然感觉效率低下,因此,我希望能收到任何改进的反馈。
import pandas
def gen_data(key):
#EXAMPLE FUNCTIONS THESE COULD BE ANYTHING AND NOT NECESSARY RELATED TO OTHER COLUMNS
data_dict = {'C':key+key, 'D':key, 'E':key+key+key}
return data_dict
A_list = ['a', 'b', 'c', 'd', 'f']
df = pandas.DataFrame({'A': ['a', 'b', 'c', 'd', 'e', 'f'], 'B': [1,2,3,3,2]})
for A_value in A_list:
data_dict = gen_data(A_value)
try:
df.loc[df.A == key] = df.loc[df.A == key].assign(**data_dict)
except ValueError:
df = df.reindex(df.columns.tolist() + list(data_dict.keys()))
df.loc[df.A == file_id] = df.loc[df.A == key].assign(**data_dict)