我想使用数据框对零件进行分类。
简化问题以尝试显示问题:
data = {'col1': ['engine','blue engine cover','spark plug',
'rear panel','black rear panel', 'blue engine']}
desc_df = pd.DataFrame(data=data)
catg = {'bodywork': ['engine cover','side panel','rear panel'],'underhood':['engine','spark plug','oil filter'],
'Glass':['Windscreen','window','demister']}
catg_df = pd.DataFrame(data=catg)
catg_df
Glass bodywork underhood
0 Windscreen engine cover engine
1 window side panel spark plug
2 demister rear panel oil filter
desc_df
col1
0 engine
1 blue engine cover
2 spark plug
3 rear panel
4 black rear panel
5 blue engine
我想最终:
col1 Category
0 engine underhood
1 blue engine cover underhood
2 spark plug underhood
3 rear panel bodywork
4 black rear panel bodywork
5 blue engine underhood
我最接近的是:
d=catg_df.apply('|'.join).to_dict()
desc_df['Category'] = desc_df['col1'].apply(lambda x : ''.join([z if pd.Series(x).str.contains(y).values else '' for z,y in d.items()]))
但我最终在字符串中找到了“引擎”和“引擎封面”: desc_df
col1 Category
0 engine underhood
1 blue engine cover bodyworkunderhood
2 spark plug underhood
3 rear panel bodywork
4 black rear panel bodywork
5 blue engine underhood
我是否可以使用某种方法,如果它首先找到“引擎覆盖”,然后使用此类别进行分类,而不是转移到“引擎”。
答案 0 :(得分:3)
一种方法是使用RewriteRule ^%25C3%25B6(.*) ö$1 [L,R=301]
获取最接近的值和difflib
:
首先创建一个映射器:
lambda
因此,mapper将如下:
from difflib import get_close_matches
mapper = {val:k for k, v in catg_df.to_dict('list').items() for val in v}
print(mapper)
现在,使用{'Windscreen': 'Glass',
'demister': 'Glass',
'engine': 'underhood',
'engine cover': 'bodywork',
'oil filter': 'underhood',
'rear panel': 'bodywork',
'side panel': 'bodywork',
'spark plug': 'underhood',
'window': 'Glass'}
和lambda
查找最接近的值:
difflib
结果:
# avoid calling mapper.keys() in lambda
keys = mapper.keys()
desc_df['Category'] = desc_df['col1'].apply(lambda row: mapper[get_close_matches(row, keys)[0]])
答案 1 :(得分:1)
您可以通过迭代字典来解决此问题:
from collections import OrderedDict
d = OrderedDict([(k, '|'.join(catg_df[k].tolist())) for k in catg_df.columns[::-1]])
for k, v in d.items():
desc_df.loc[desc_df['col1'].str.contains(v), 'Category'] = k
<强>结果强>
print(desc_df)
col1 Category
0 engine underhood
1 blue engine cover bodywork
2 spark plug underhood
3 rear panel bodywork
4 black rear panel bodywork
5 blue engine underhood
<强>解释强>
str.contains
条件与正则表达式值,并将键指定给“类别”列。collections.OrderedDict
优先考虑列。d
的构造中反转列的迭代顺序。