我有一个 Python 字典如下:
ref_dict = {
"Company1" :["C1_Dev1","C1_Dev2","C1_Dev3","C1_Dev4","C1_Dev5",],
"Company2" :["C2_Dev1","C2_Dev2","C2_Dev3","C2_Dev4","C2_Dev5",],
"Company3" :["C3_Dev1","C3_Dev2","C3_Dev3","C3_Dev4","C3_Dev5",],
}
我有一个名为 df 的 Pandas 数据框,其中一列如下所示:
DESC_DETAIL
0 Probably task Company2 C2_Dev5
1 File system C3_Dev1
2 Weather subcutaneous Company2
3 Company1 Travesty C1_Dev3
4 Does not match anything
...........
我的目标是向此数据框中添加两个额外的列,并将这些列命名为 COMPANY 和 DEVICE。 COMPANY 列的每一行中的值要么是字典中的公司键,如果它存在于 DESC_DETAIL 列中,或者如果相应的设备存在于 >DESC_DETAIL 列。 DEVICE 列中的值将只是 DESC_DETAIL 列中的设备字符串。如果未找到匹配项,则对应的行为空。因此最终输出将如下所示:
DESC_DETAIL COMPANY DEVICE
0 Probably task Company2 C2_Dev5 Company2 C2_Dev5
1 File system C3_Dev1 Company3 C3_Dev1
2 Weather subcutaneous Company2 Company2 NaN
3 Company1 Travesty C1_Dev3 Company1 C1_Dev3
4 Does not match anything NaN NaN
我的尝试:
for key, value in ref_dict.items():
df['COMPANY'] = df.apply(lambda row: key if row['DESC_DETAIL'].isin(key) else Nan, axis=1)
这显然是错误的并且不起作用。我如何使它工作?
答案 0 :(得分:1)
您可以使用正则表达式模式使用 str.extract
提取值:
import re
s = pd.Series(ref_dict).explode()
# extract company
df['COMPANY'] = df['DESC_DETAIL'].str.extract(
f"({'|'.join(s.index.unique())})", flags=re.IGNORECASE)
# extract device
df['DEVICE'] = df['DESC_DETAIL'].str.extract(
f"({'|'.join(s)})", flags=re.IGNORECASE)
# fill missing company values based on device
df['COMPANY'] = df['COMPANY'].fillna(
df['DEVICE'].str.lower().map(dict(zip(s.str.lower(), s.index))))
df
输出:
DESC_DETAIL COMPANY DEVICE
0 Probably task Company2 C2_Dev5 Company2 C2_Dev5
1 File system C3_Dev1 Company3 C3_Dev1
2 Weather subcutaneous Company2 Company2 NaN
3 Company1 Travesty C1_Dev3 Company1 C1_Dev3
4 Does not match anything NaN NaN
答案 1 :(得分:0)
您还需要一个设备到公司字典,您可以从 ref_dict
轻松构建它,如下所示:
dev_to_company_dict = {v:l[0] for l in zip(ref_dict.keys(), ref_dict.values()) for v in l[1]}
然后很容易做到这一点:
df['COMPANY'] = df['DESC_DETAIL'].apply(lambda det : ''.join(set(re.split("\\s+", det)).intersection(ref_dict.keys())))
df['COMPANY'].replace('', np.nan, inplace=True)
df['DEVICE'] = df['DESC_DETAIL'].apply(lambda det : ''.join(set(re.split("\\s+", det)).intersection(dev_to_company_dict.keys())))
df['DEVICE'].replace('', np.nan, inplace=True)
df['COMPANY'] = df['COMPANY'].fillna(df['DEVICE'].map(dev_to_company_dict))
输出:
DESC_DETAIL COMPANY DEVICE
0 Probably task Company2 C2_Dev5 Company2 C2_Dev5
1 File system C3_Dev1 Company3 C3_Dev1
2 Weather subcutaneous Company2 Company2 NaN
3 Company1 Travesty C1_Dev3 Company1 C1_Dev3
4 Does not match anything NaN NaN