我有2个数据框,
我需要做的是,对于在df_be中找到的每个客户帐户(该帐户在开始时都没有mac地址),我需要在df_sa中搜索对应关系,并在df_be上添加与所有对应mac对应的列地址。下面的代码已经完成了工作,但是效率非常非常低下。我一直在寻找解决方案,因为这不适合我所拥有的信息量。
非常感谢您!
# Iterate through df_be with mac_stb = NaN
for i in np.arange(sum(df_be['mac_stb'].isnull())):
# Get account number when mac_stb = NaN
account = df_be[df_be['mac_stb'].isnull()].iloc[i]['id']
# Copy row with mac_stb = NaN so that you can replace it later by the appropriates mac_stb
new_df_be_row = df_be[df_be['mac_stb'].isnull()][df_be['id'] == account].copy()
# Iterate through mac_stb associated with account id
for mac in df_sa.loc[df_sa['sa'] == account]['mac_stb']:
new_df_be_row['mac_stb'] = mac
df_be = df_be.append(new_df_be_row, ignore_index=True)
# Drop rows with mac_stb = NaN as these have been replaced with possible mac_stb
df_be.dropna(subset=['mac_stb'], inplace=True)
df_be.reset_index(drop=True, inplace=True)
原始数据:
timestamp time_ms id id_type mac_stb
720 2019-07-07 17:16:06.304 18.0 641269DD04B1 boxId 641269DD04B1
721 2019-07-07 17:16:06.291 9.0 98F7D7198F88 boxId 98F7D7198F88
722 2019-07-07 17:16:06.291 6.0 A0C5624B2D79 boxId A0C5624B2D79
723 2019-07-07 17:16:06.288 18.0 7085C6AAB849 device 7085C6AAB849
724 2019-07-07 17:16:06.304 18.0 S828093664 account NaN
725 2019-07-07 17:16:06.319 4.0 707630BC92E7 boxId 707630BC92E7
726 2019-07-07 17:16:06.319 8.0 S827336056 account NaN
727 2019-07-07 17:16:06.320 9.0 707630BC8FA8 device 707630BC8FA8
728 2019-07-07 17:16:06.340 9.0 S831286437 account NaN
729 2019-07-07 17:16:06.335 13.0 S841512815 account NaN
mac_stb mac_cm sa
0 001E690D2C83 001E690D2C82 S827336056
1 001E690D2D8F 001E690D2D8E S831286437
2 001E690D311D 001E690D311C S841512815
3 001E690D4053 001E690D4052 S830161775
4 001E690D4B91 001E690D4B90 S825327910
IGNORE mac_cm
重要更新: 对于每个帐户,可能存在多个mac_stb,因此对于BE中的每个请求,它必须显示所有可能的mac_stb,如果存在多个,则应添加第二行/第三个mac_stb可用的新行。 >
示例: This is the example of df_be before populating mac_stb
答案 0 :(得分:0)
您可以使用合并。它可能看起来相当复杂,但是从本质上讲,它与您的解决方案非常相似。掩码替换了您的第一个循环,而merge()替换了第二个循环。其余的只是清理。
mask_no_mac_stb = df_be['mac_stb'].isnull()
df_merged = pd.merge(df_be[mask_no_mac_stb], df_sa, left_on='id', right_on='sa').set_index('id').drop('mac_stb_x', axis=1).rename({'mac_stb_y': 'mac_stb'}, axis=1)
df_be = df_be.set_index('id')
df_be.loc[df_merged.index, 'mac_stb'] = df_merged.loc[df_merged.index, 'mac_stb']
df_be = df_be.reset_index()
更新:
如果您不能保证所有ID都是唯一的,则可以使用字典并应用:
mask_no_mac_stb = df_be['mac_stb'].isnull()
map_id2mac = {id: mac for (id, mac) in zip(df_sa['sa'].to_list(), df_sa['mac_stb'].to_list())}
df_be.loc[mask_no_mac_stb, 'mac_stb'] = df_be.loc[mask_no_mac_stb, 'id'].apply(lambda x: map_id2mac.get(x, np.nan))
提供的原始数据的输出(双向相同):
id timestamp time_ms id_type mac_stb
0 641269DD04B1 17:16:06.304 18.00 boxId 641269DD04B1
1 98F7D7198F88 17:16:06.291 9.00 boxId 98F7D7198F88
2 A0C5624B2D79 17:16:06.291 6.00 boxId A0C5624B2D79
3 7085C6AAB849 17:16:06.288 18.00 device 7085C6AAB849
4 S828093664 17:16:06.304 18.00 account NaN
5 707630BC92E7 17:16:06.319 4.00 boxId 707630BC92E7
6 S827336056 17:16:06.319 8.00 account 001E690D2C83
7 707630BC8FA8 17:16:06.320 9.00 device 707630BC8FA8
8 S831286437 17:16:06.340 9.00 account 001E690D2D8F
9 S841512815 17:16:06.335 13.00 account 001E690D311D
注意:id 7085C6AAB849的值仍为NaN,因为它不在df_sa中