目前,我的表有超过10000000个记录,并且有一个名为ID
的列,如果ID
在给定列表中,我想用新值更新名为'3rd_col'的列
我使用.loc
,这是我的代码
for _id in given_ids:
df.loc[df.ID == _id, '3rd_col'] = new_value
但是上面代码的性能很慢,我怎样才能提高更新值的性能?
很抱歉,这里我想更具体地说明我的问题,不同的ID根据功能分配不同的值,并且大约有4列要分配。
for _id in given_ids:
df.loc[df.ID == _id, '3rd_col'] = return_new_val_1(id)
df.loc[df.ID == _id, '4rd_col'] = return_new_val_2(id)
df.loc[df.ID == _id, '5rd_col'] = return_new_val_3(id)
df.loc[df.ID == _id, '6rd_col'] = return_new_val_4(id)
答案 0 :(得分:5)
您可以先创建dictionary
,然后replace
:
#sample function
def return_new_val(x):
return x * 3
given_ids = list('abc')
d = {_id: return_new_val(_id) for _id in given_ids}
print (d)
{'a': 'aaa', 'c': 'ccc', 'b': 'bbb'}
df = pd.DataFrame({'ID':list('abdefc'),
'M':[4,5,4,5,5,4]})
df['3rd_col'] = df['ID'].replace(d)
print (df)
ID M 3rd_col
0 a 4 aaa
1 b 5 bbb
2 d 4 d
3 e 5 e
4 f 5 f
5 c 4 ccc
或map
,但随后获得NaN
无匹配:
df['3rd_col'] = df['ID'].map(d)
print (df)
ID M 3rd_col
0 a 4 aaa
1 b 5 bbb
2 d 4 NaN
3 e 5 NaN
4 f 5 NaN
5 c 4 ccc
编辑:
如果需要通过多个功能追加数据,请首先创建新的DataFrame
,然后join
创建原始数据:
def return_new_val1(x):
return x * 2
def return_new_val2(x):
return x * 3
given_ids = list('abc')
df2 = pd.DataFrame({'ID':given_ids})
df2['3rd_col'] = df2['ID'].map(return_new_val1)
df2['4rd_col'] = df2['ID'].map(return_new_val2)
df2 = df2.set_index('ID')
print (df2)
3rd_col 4rd_col
ID
a aa aaa
b bb bbb
c cc ccc
df = pd.DataFrame({'ID':list('abdefc'),
'M':[4,5,4,5,5,4]})
df = df.join(df2, on='ID')
print (df)
ID M 3rd_col 4rd_col
0 a 4 aa aaa
1 b 5 bb bbb
2 d 4 NaN NaN
3 e 5 NaN NaN
4 f 5 NaN NaN
5 c 4 cc ccc
#bur replace NaNs by values in `ID`
cols = ['3rd_col','4rd_col']
df[cols] = df[cols].mask(df[cols].isnull(), df['ID'], axis=0)
print (df)
ID M 3rd_col 4rd_col
0 a 4 aa aaa
1 b 5 bb bbb
2 d 4 d d
3 e 5 e e
4 f 5 f f
5 c 4 cc ccc