熊猫中的模糊合并和最接近的行匹配

时间:2019-08-27 23:56:45

标签: python pandas nlp

我有两个数据帧。每行包含1个单词。它们非常接近,但是存在拼写错误,有时一个df有一个或两个单词,而另一个则没有。

通常,我想将df2.word与df1.metadata结合在一起。如果df2.word和df1.word匹配,拼写相近或足够接近且彼此相距1行以内,我想将df2.word与df1.metadata连接起来。如果没有直接匹配项或在1行内没有匹配项,我想删除此行。

我有:

df1

word      metadata    metadata2

okay        1           A
I           1           A
win         1           A
tree        1           A
apples      1           A
also        0           B
would       0           B
like        0           B
for         0           B 
oranges     0           B

df2

word

OK.         
I          
want        
three       
apples.     
Also,        
I           
would       
like          
four        
oranges.    


What I want is:

word      metadata    metadata2

OK.         1           B
I           1           B
want        1           B
three       1           B
apples.     1           B
Also,       0           B       
would       0           B
like        0           B
four        0           B
oranges.    0           B
.owl

1 个答案:

答案 0 :(得分:1)

因为模糊匹配是一个昂贵的过程,尤其是随着它随您拥有的数据量而扩展时,我相信您应该为此使用并发性。另外,我认为要达到100%的准确性非常困难,因此您必须满足以下假设:

import pandas as pd
from fuzzywuzzy import process
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor


def get_match(word):
    match, score, _ = process.extractOne(word, df1['word'])
    if score > 50:
        s = df1.loc[df1['word'].eq(match), 'metadata'].iloc[0]
        return [word, s]

def main():
    ##swap out ThreadPoolExecutor with ProcessPoolExecutor to switch from
    ##multithreading to multiprocessing
    with ThreadPoolExecutor() as executor:
        results = executor.map(get_match, df2['word'])
        return (r for r in results if r)

if __name__ == '__main__':
    df = pd.DataFrame(main(), columns=['word', 'metadata'])
    print(df)

        word  metadata
0        OK.         1
1          I         1
2       want         1
3      three         1
4    apples.         1
5      Also,         0
6          I         1
7      would         0
8       like         0
9       four         0
10  oranges.         0