我有一个多列的pandas数据框和一个以键和值作为列表的字典。在df中,一栏代表说明,我需要查看此说明,并检查其是否与字典列表中的值之一匹配。
这是字典的摘录:
clothing_types = {'T-Shirt': ['t-shirt', 'shirt', 'tee'],
'Tank Top': ['tank top', 'mesh', 'top', 'tank'],
'Socks': ['socks'],
'Hat': ['cap'],
'Trainers': ['trainers', 'snickers', 'shoes', 'furylite
contemporary'}
这是列:
0 UNDER ARMOUR LADIES FLY-BY STRETCH MESH TANK TOP
1 UNDER ARMOUR LADIES SPEEDFORM NO SHOW SOCKS
2 UNDER ARMOUR LADIES SPEEDFORM NO SHOW SOCKS
3 UNDER ARMOUR LADIES PLAY UP SHORTS
4 REEBOK LADIES CLASSIC LEATHER MID TRAINERS
5 UNDER ARMOUR MENS Spring Performance Oxford SHIRT
6 UNDER ARMOUR LADIES HEATGEAR ALPHA SHORTY SHORTS
7 ADIDAS LADIES PRO TANK
8 REEBOK LADIES ONE SERIES V NECK T-SHIRT
9 REEBOK LADIES DF LONG BRA
10 NIKE LADIES BASELINE TENNIS SKIRT
11 UNDER ARMOUR MENS ESCAPE 7" SOLID SHORTS
12 UNDER ARMOUR LADIES FLY-BY STRETCH MESH TANK TOP
我可以通过普通的for循环进行比较:
for item in self.original_file['Product Description'].tolist():
found = False
for item_type, type_descriptions in clothing_types.items():
for description in type_descriptions:
if description.upper() in item.upper():
# print(item_type, item)
found = True
break
if not found:
print('NOT FOUND', item)
并尝试使用np.where:
for item_type, type_descriptions in clothing_types.items():
for description in type_descriptions:
self.original_file['Category'] = np.where(description.upper() in self.original_file['Product Description'], item_type, 'None')
但是它将值替换为最后一个值比较,这使列值始终为无
期望的是,如果在描述中让“ SHIRT”说“ SHIRT”(这是字典的关键),则会在新列-Category
答案 0 :(得分:0)
我们可以用str.contains
检查是否找到任何匹配项。如果获得成功,则填写字典的key
,否则不填。最后,我们将所有空格和匹配项删除为一列:
matches = [np.where(df['Product Description'].str.contains('|'.join(v), case=False),
k,
'') for k, v in clothing_types.items()]
matches_df = pd.DataFrame(matches).T.sum(axis=1).to_frame('Matches')
df = df.join(matches_df)
输出
Product Description Matches
0 UNDER ARMOUR LADIES FLY-BY STRETCH MESH TANK TOP Tank Top
1 UNDER ARMOUR LADIES SPEEDFORM NO SHOW SOCKS Socks
2 UNDER ARMOUR LADIES SPEEDFORM NO SHOW SOCKS Socks
3 UNDER ARMOUR LADIES PLAY UP SHORTS
4 REEBOK LADIES CLASSIC LEATHER MID TRAINERS Trainers
5 UNDER ARMOUR MENS Spring Performance Oxford SHIRT T-Shirt
6 UNDER ARMOUR LADIES HEATGEAR ALPHA SHORTY SHORTS
7 ADIDAS LADIES PRO TANK Tank Top
8 REEBOK LADIES ONE SERIES V NECK T-SHIRT T-Shirt
9 REEBOK LADIES DF LONG BRA
10 NIKE LADIES BASELINE TENNIS SKIRT
11 UNDER ARMOUR MENS ESCAPE 7" SOLID SHORTS Hat
12 UNDER ARMOUR LADIES FLY-BY STRETCH MESH TANK TOP Tank Top
答案 1 :(得分:0)
这可行,但不确定这是否是最佳解决方案
for i in self.original_file.index:
for item_type, type_descriptions in clothing_types.items():
for description in type_descriptions:
if description.upper() in self.original_file.iloc[i]['Product Description'].upper():
self.original_file.at[i, 'Category'] = item_type
答案 2 :(得分:0)
首先,您应该像这样在服装类型字典中在键和值之间进行切换
lothing_types2 = dict(list(itertools.chain(*[[(y_, x) for y_ in y] for x, y in clothing_types.items()])))
然后,创建一个函数以按行搜索(如果在创建的新字典中有任何单词)
def to_category(x):
for w in x.lower().split(" "):
if w in clothing_types2:
return clothing_types2[w]
return None
最后,将方法应用于列,然后将结果保存到新的方法:
df["Category"] = df["clothing"].apply(to_category)