Question

我有两个数据帧。每行包含1个单词。它们非常接近，但是存在拼写错误，有时一个df有一个或两个单词，而另一个则没有。

通常，我想将df2.word与df1.metadata结合在一起。如果df2.word和df1.word匹配，拼写相近或足够接近且彼此相距1行以内，我想将df2.word与df1.metadata连接起来。如果没有直接匹配项或在1行内没有匹配项，我想删除此行。

我有：

df1

word      metadata    metadata2

okay        1           A
I           1           A
win         1           A
tree        1           A
apples      1           A
also        0           B
would       0           B
like        0           B
for         0           B 
oranges     0           B

df2

word

OK.         
I          
want        
three       
apples.     
Also,        
I           
would       
like          
four        
oranges.    


What I want is:

word      metadata    metadata2

OK.         1           B
I           1           B
want        1           B
three       1           B
apples.     1           B
Also,       0           B       
would       0           B
like        0           B
four        0           B
oranges.    0           B

.owl

Answer 1

因为模糊匹配是一个昂贵的过程，尤其是随着它随您拥有的数据量而扩展时，我相信您应该为此使用并发性。另外，我认为要达到100％的准确性非常困难，因此您必须满足以下假设：

import pandas as pd
from fuzzywuzzy import process
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor


def get_match(word):
    match, score, _ = process.extractOne(word, df1['word'])
    if score > 50:
        s = df1.loc[df1['word'].eq(match), 'metadata'].iloc[0]
        return [word, s]

def main():
    ##swap out ThreadPoolExecutor with ProcessPoolExecutor to switch from
    ##multithreading to multiprocessing
    with ThreadPoolExecutor() as executor:
        results = executor.map(get_match, df2['word'])
        return (r for r in results if r)

if __name__ == '__main__':
    df = pd.DataFrame(main(), columns=['word', 'metadata'])
    print(df)

        word  metadata
0        OK.         1
1          I         1
2       want         1
3      three         1
4    apples.         1
5      Also,         0
6          I         1
7      would         0
8       like         0
9       four         0
10  oranges.         0

熊猫中的模糊合并和最接近的行匹配

1 个答案: