我有一个数据集,其中包含几列作为项目列表。我在下面给出一个例子。我正在尝试查找列表中具有100%匹配项的条目。我想找到90%或更低的产品。
>>> df2 = pd.DataFrame({ 'ID':['1', '2', '3', '4', '5', '6', '7', '8'], 'Productdetailed': [['Phone', 'Watch', 'Pen'], ['Pencil', 'fork', 'Eraser'], ['Apple', 'Mango', 'Orange'], ['Something', 'Nothing', 'Everything'], ['Eraser', 'fork', 'Pencil'], ['Phone', 'Watch', 'Pen'],['Apple', 'Mango'], ['Pen', 'Phone', 'Watch']]})
>>> df2
ID Productdetailed
0 1 [Phone, Watch, Pen]
1 2 [Pencil, fork, Eraser]
2 3 [Apple, Mango, Orange]
3 4 [Something, Nothing, Everything]
4 5 [Eraser, fork, Pencil]
5 6 [Phone, Watch, Pen]
6 7 [Apple, Mango]
7 8 [Pen, Phone, Watch]
如果您注意到df2
中的索引0和索引7,则它们具有相同的项目集,但顺序不同。其中索引0和索引5具有相同顺序的相同项目。我想将两者都视为比赛。我尝试了groupby
和series.isin()
。我还尝试通过将数据集一分为二来尝试相交,但是由于类型错误而失败。
首先,我想计算精确匹配项的数量(匹配行数的数量)以及与其匹配的行索引号。但是,当某些项目仅部分匹配时,例如df2中的索引2和索引6。我想说的是已匹配项目的百分比以及与之对应的列号。
我提到了。我试图将特定列值的数据分为两部分。然后
applied df2['Intersection'] =
[list(set(a).intersection(set(b)))
for a, b in zip(df2_part1.Productdetailed, df2_part2.Productdetailed)
]
,其中a
和b
是Productdetailed
和df2_part1
的片段中的df2_part2
列。
有没有办法做到这一点?请帮助
答案 0 :(得分:2)
要知道确切的匹配项:
df2["Productdetailed"]=df2["Productdetailed"].sort_values()
# create new colum from the sorted list. More easy to work with pivot table
df2['Productdetailed_str'] = df2['Productdetailed'].apply(lambda x: ', '.join(x))
df2["hit"] = 1
df3 = (df2.pivot_table(index=["Productdetailed_str"],
values=["ID", "hit"],
aggfunc={'ID': lambda x: ', '.join(x), 'hit': 'sum'}
))
命中率是出现的次数。 结果df3:
ID hit
Productdetailed_str
Apple, Mango 7 1
Apple, Mango, Orange 3 1
Eraser, fork, Pencil 5 1
Pen, Phone, Watch 8 1
Pencil, fork, Eraser 2 1
Phone, Watch, Pen 1, 6 2
Something, Nothing, Everything 4 1
部分匹配更加困难,但是您可以开始拆分列表并使用数据透视表进行播放:
test = df2.apply(lambda x: pd.Series(x['Productdetailed']),axis=1).stack().reset_index(level=1, drop=True).to_frame(name='list').join(df2)
如果运行测试。您在“列表列”中的“产品详细列”列表中的单词。另外,您有ID ...因此,我认为使用数据透视表可以提取信息。
答案 1 :(得分:1)
此解决方案解决了完全匹配任务(代码复杂度很高,不建议使用):
NSPersistentStoreCoordinator
用于完全匹配和部分匹配(如果至少两个值匹配,则部分匹配,也可以更改):
#First create a dummy column of Productdetailed which is sorted
df2['dummy'] = df2['Productdetailed'].apply(sorted)
#Create Matching column which stores index of first matched list
df2['Matching'] = np.nan
#Code for finding the exact matches and assigning indices in Matching column
for index1,lst1 in enumerate(df2['dummy']):
for index2,lst2 in enumerate(df2['dummy']):
if index1<index2:
if (lst1 == lst2):
if np.isnan(df2.loc[index2,'Matching']):
df2.loc[index1,'Matching'] = index1
df2.loc[index2,'Matching'] = index1
#Finding the sum of total exact matches
print(df2['Matching'].notnull().sum())
5
#Deleting the dummy column
del df2['dummy']
#Final Dataframe
print(df2)
ID Productdetailed Matching
0 1 [Phone, Watch, Pen] 0.0
1 2 [Pencil, fork, Eraser] 1.0
2 3 [Apple, Mango, Orange] NaN
3 4 [Something, Nothing, Everything] NaN
4 5 [Eraser, fork, Pencil] 1.0
5 6 [Phone, Watch, Pen] 0.0
6 7 [Apple, Mango] NaN
7 8 [Pen, Phone, Watch] 0.0
#First create a dummy column of Productdetailed which is sorted
df2['dummy'] = df2['Productdetailed'].apply(sorted)
#Create Matching column which stores index of first matched list
df2['Matching'] = np.nan
#Create Column Stating Status of Matching
df2['Status'] = 'No Match'
#Code for finding the exact matches and assigning indices in Matching column
for index1,lst1 in enumerate(df2['dummy']):
for index2,lst2 in enumerate(df2['dummy']):
if index1<index2:
if (lst1 == lst2):
if np.isnan(df2.loc[index2,'Matching']):
df2.loc[index1,'Matching'] = index1
df2.loc[index2,'Matching'] = index1
df2.loc[[index1,index2],'Status'] = 'Fully Matched'
else:
count = sum([1 for v1 in lst1 for v2 in lst2 if v1==v2])
if count>=2:
if np.isnan(df2.loc[index2,'Matching']):
df2.loc[index1,'Matching'] = index1
df2.loc[index2,'Matching'] = index1
df2.loc[[index1,index2],'Status'] = 'Partially Matched'
#Finding the sum of total exact matches
print(df2['Matching'].notnull().sum())
7
#Deleting the dummy column
del df2['dummy']
#Final Dataframe
print(df2)