我正在尝试基于字符串相似率在pandas数据框中创建一个新列。我要解决的问题是,相似率应基于存储在一个单数列中的值。 假设数据框如下所示:
我尝试过以下操作:
import pandas as pd
import difflib
from functools import partial
test = {'TaskBarcode': {618: 'TRFX90086BSE',
622: 'TRFX9008DUDJ',
624: 'TRFX9008DYFN',
625: 'TRFX9008PXLC',
628: 'TRFX9008GKQ5',
633: 'TRFX9008DY91',
637: 'TRFX9008F13V',
638: 'TRFX9008H9TK',
639: 'TRFX9008DGPT',
641: 'TRFX9008D1NJ'},
'STSK_NAME': {618: '60046100 kick strip missing 10HJK',
622: 'Dwars #motor 1 in Fancowl doors Kluh',
624: 'Cabin/under floor dirty/clean',
625: 'COVER MISSING ON ECONOMY CLASS SEAT FOODTRAY.',
628: '10123341 lh rwy t/o light',
633: 'Cabine/wet blankets/remove/dry/install',
637: 'Ident emergency Exit',
638: 'CABIN / G2 / INSERT MISSING / AIRCHILLER COMPARTMENT / REPLACE',
639: 'Cabin/seats/outlet box loose on position 3F.',
641: 'Seat indication placard of seat 15 ABC damaged.'}}
df_test = pd.DataFrame.from_dict(test)
def apply_sm(s, c1, c2):
return difflib.SequenceMatcher(None, c1, c2).ratio()
df_test['Group'] = df_test.apply(partial(apply_sm, c1='STSK_NAME', c2='STSK_NAME'), axis=1)
基本上,我正在尝试创建一个新列,在该列中将相似字符串(即相似比率高的字符串)分组到togheter。
编辑: 所需的Oputput将类似于:
TaskBarcode STSK_NAME Group
622 TRFX9008DUDJ 60046100 kick strip missing 10HJK 1
624 TRFX9008DYFN Dwars motor 1 in Fancowl doors Kluh 2
625 TRFX9008PXLC Cabin/under floor dirty/clean 3
628 TRFX9008GKQ5 COVER MISSING ON ECONOMY CLASS SEAT FOODTRAY 4
633 TRFX9008DY91 10123341 lh rwy t/o light 5
637 TRFX9008F13V Cabine/wet blankets/remove/dry/install 3
638 TRFX9008H9TK Ident emergency Exit 6
639 TRFX9008DGPT CABIN / G2 / INSERT MISSING / AIRCHILLER 3
641 TRFX9008D1NJ Seat indication placard of seat 15 ABC damaged 7