我的数据框带有产品标题列:
product_title price
Touch Pen For Stylus Apple Pencil iPad iPhone Pro Max For Samsung Smartphone Tablet 10.65
5 RCA Ypbpr component to HDMI HDTV video audio converter adapter with power supply 20
FDGAO 4 in 1 Wireless Charging Stand For Apple Watch 5 4 3 2 1 iPhone 11 X XS XR 8 100
我想将每行产品(每行)分类到一个类别(例如笔记本电脑,智能手机..)
我创建了此函数以与pandas apply一起应用:
def cat(d):
if "3d" in d['title'].lower():
return "3d"
if "tablet" in d['title'].lower():
return "tablet"
if "surveillance" in d['title'].lower():
return "surveillance"
if "projector" in d['title'].lower():
return "projector"
if any(x in d['title'].lower() for x in ['dvb', 'satellite']):
return "satellite"
#...
else:
return "NaN"
df['category']= df.apply(cat, axis=1)
问题是某些产品标题包含许多类别关键字而不属于其中之一,例如以下标题:
Air 3 TWS Bluetooth Earphone Pro Wireless Headphones with Mic In-ear Stereo Earbuds Sports Gaming Headset for all Smartphone
此标题包含耳机和耳机和耳机,可以在电话配件中对其进行分类,但是问题是它包含智能手机,并且有智能手机(不是附件)的类别,例如,该标题适用于智能手机:
128GB Smartphone Global Version 48MP dual caemra Mobile Phone 4000mAh Battery 6.59inch
当我使用.loc进行过滤时,它总是返回空结果 这是数据集的示例: https://docs.google.com/spreadsheets/d/1oTHP3JU7FlK_wAye5KYSelWqziXRSKBBeXKrLWkeaFA/edit?usp=sharing
我也不知道如何使用整个数据集找到所有可能的类别