我有两个数据帧。每行包含1个单词。它们非常接近,但是存在拼写错误,有时一个df有一个或两个单词,而另一个则没有。
通常,我想将df2.word与df1.metadata结合在一起。如果df2.word和df1.word匹配,拼写相近或足够接近且彼此相距1行以内,我想将df2.word与df1.metadata连接起来。如果没有直接匹配项或在1行内没有匹配项,我想删除此行。
我有:
df1
word metadata metadata2
okay 1 A
I 1 A
win 1 A
tree 1 A
apples 1 A
also 0 B
would 0 B
like 0 B
for 0 B
oranges 0 B
df2
word
OK.
I
want
three
apples.
Also,
I
would
like
four
oranges.
What I want is:
word metadata metadata2
OK. 1 B
I 1 B
want 1 B
three 1 B
apples. 1 B
Also, 0 B
would 0 B
like 0 B
four 0 B
oranges. 0 B
.owl
答案 0 :(得分:1)
因为模糊匹配是一个昂贵的过程,尤其是随着它随您拥有的数据量而扩展时,我相信您应该为此使用并发性。另外,我认为要达到100%的准确性非常困难,因此您必须满足以下假设:
import pandas as pd
from fuzzywuzzy import process
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
def get_match(word):
match, score, _ = process.extractOne(word, df1['word'])
if score > 50:
s = df1.loc[df1['word'].eq(match), 'metadata'].iloc[0]
return [word, s]
def main():
##swap out ThreadPoolExecutor with ProcessPoolExecutor to switch from
##multithreading to multiprocessing
with ThreadPoolExecutor() as executor:
results = executor.map(get_match, df2['word'])
return (r for r in results if r)
if __name__ == '__main__':
df = pd.DataFrame(main(), columns=['word', 'metadata'])
print(df)
word metadata
0 OK. 1
1 I 1
2 want 1
3 three 1
4 apples. 1
5 Also, 0
6 I 1
7 would 0
8 like 0
9 four 0
10 oranges. 0