我开始使用熊猫,遇到了一个我不知道如何解决的问题。
我有两个数据框。
第一个包含汽车信息,包括汽车型号(列DESCR_MARCA_VEICULO)
df1
col1 col2 DESCR_MARCA_VEICULO
.... .... 'GM/CELTA 5 PORTAS SUPER'
.... .... 'VW/VOYAGE LS'
.... .... 'VW/GOL LS'
.... .... 'I/AUDI A4 2.0T FSI'
.... .... 'FIAT/UNO CS IE'
第二个包含一个两列的de-para,其中包含汽车模型和与该模型关联的唯一ID,例如:
df2
ID DESCR_MARCA_VEICULO
1 'GM - CELTA 5'
2 'VW - VOYAGE LS'
3 'VW - GOL LS'
4 'ACURA - INTEGRA GS 1.8'
5 'AUDI - 80 S2 AVANT'
它并不一定要遵循将“ /”替换为“-”之类的模式。
但是,我在DF1中有5000多种不同型号的汽车(这使我无法逐案查看),并且我需要将DF1和DF2结合起来,以将ID列添加到DF1中(这将是合并)。但是,当合并dfs时,由于字符串中的这些差异而导致不匹配。
有什么办法可以通过DESCR_MARCA_VEICULO列中的字符串之间的相似性来合并这些df?
谢谢:)
答案 0 :(得分:0)
我建议您使用Fuzzywuzzy软件包研究模糊匹配
一种技巧是从字符串中删除所有字符,字母和数字除外
df1['var'] = df1['DESCR_MARCA_VEICULO'].str.replace('[^A-Za-z0-9]','')
df2['var'] = df2['DESCR_MARCA_VEICULO'].str.replace('[^A-Za-z0-9]','')
m = pd.merge(df1,df2,on='var').drop('var',axis=1)
print(m)
col1 col2 DESCR_MARCA_VEICULO_x matches ID DESCR_MARCA_VEICULO_y
0 .... .... VW/VOYAGE LS VWVOYAGELS 2 VW - VOYAGE LS
1 .... .... VW/GOL LS VWGOLLS 3 VW - GOL LS
我们需要匹配几个缺少的键:让我们使用post的Erfan答案
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
df_1 is the left table to join
df_2 is the right table to join
key1 is the key column of the left table
key2 is the key column of the right table
threshold is how close the matches should be to return a match, based on Levenshtein distance
limit is the amount of matches that will get returned, these are sorted high to low
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
return df_1
print(fuzzy_merge(df1, df2, 'var', 'var', threshold=80))
col1 col2 DESCR_MARCA_VEICULO var matches
0 .... .... GM/CELTA 5 PORTAS SUPER GMCELTA5PORTASSUPER GMCELTA5
1 .... .... VW/VOYAGE LS VWVOYAGELS VWVOYAGELS
2 .... .... VW/GOL LS VWGOLLS VWGOLLS
3 .... .... I/AUDI A4 2.0T FSI IAUDIA420TFSI
4 .... .... FIAT/UNO CS IE FIATUNOCSIE