我试图将一个数据帧中的近似电影标题与另一个数据帧中最接近的实际电影标题匹配。这是第一个数据帧:
Old ID New ID Movie1 Movie2
3101 771355141 this. This is Dogma Shaun of the Dead.
11903 18330 tale mise en scene giallo thriller posing as
16077 771225176 Evil Dead II Brothers Grimm and Hawkeye
NaN 381422014 'Requiem for a Dream' 'After Dark." If only
4540 770676801 Ocean's Eleven Saved By The Bell.
9103 770673272 It's Godzilla The Blair Witch Project,
2473 49248746 day classic. Die Hard Fellini? No. But maybe.
这是第二个数据帧:
Old id New id Title
NaN 21736.0 Peter Pan
NaN 771471359.0 Dragonheart Battle for the Heartfire
NaN 770725090.0 The Nude Vampire Vampire nue, La
2281.0 19887.0 Beyond the Clouds
10913.0 11286.0 Wild America
NaN 17635.0 Sexual Dependency
NaN 666370586.0 Body Slam
709.0 771203994.0 Hatchet II
NaN 11655.0 Lion of the Desert Omar Mukhtar
15492.0 770681831.0 Imagine That
我试图匹配近似的Movie1& df1中的Movie2标题为df2中的实际标题。到目前为止我所做的是定义一个函数,该函数根据difflib中定义的字符串距离返回最接近的电影标题。
from difflib import SequenceMatcher as SM
def find_closest(approx_title, movies):
"""Return the title in movies that is closest to the approximate title based on
difflib distance
"""
return max(movies, key = lambda title: SM(None, approx_title, title).ratio())
我的想法是将此功能应用于Movie1&中的每个元素。 Movie2并将其另存为新列。
movie_lst = rt_info['Title'].tolist()
movie_lst = [str(x) for x in movie_lst]
movie1_match = meets_df['Movie1'].apply(lambda movie:
find_closest(movie, movie_lst))
然而,这个函数运行得非常慢(movie_lst中有24883部电影),我认为将该函数应用为lambda可能效率低(与max函数中的lambda混合)如何优化此过程?谢谢!