如何在另一个数据帧列中的pandas dataframe列中找到最接近的字符串匹配

时间:2017-12-21 22:26:37

标签: pandas apply fuzzy-comparison

我试图将一个数据帧中的近似电影标题与另一个数据帧中最接近的实际电影标题匹配。这是第一个数据帧:

Old ID  New ID      Movie1                  Movie2
3101    771355141   this. This is Dogma     Shaun of the Dead.
11903   18330       tale mise en scene      giallo thriller posing as
16077   771225176   Evil Dead II            Brothers Grimm and Hawkeye
NaN     381422014   'Requiem for a Dream'   'After Dark." If only
4540    770676801   Ocean's Eleven          Saved By The Bell.
9103    770673272   It's Godzilla           The Blair Witch Project,
2473    49248746    day classic. Die Hard   Fellini? No. But maybe.

这是第二个数据帧:

Old id  New id  Title
NaN     21736.0 Peter Pan
NaN 771471359.0 Dragonheart Battle for the Heartfire
NaN 770725090.0 The Nude Vampire Vampire nue, La
2281.0  19887.0 Beyond the Clouds
10913.0 11286.0 Wild America
NaN   17635.0   Sexual Dependency
NaN 666370586.0 Body Slam
709.0   771203994.0 Hatchet II
NaN  11655.0    Lion of the Desert Omar Mukhtar
15492.0 770681831.0 Imagine That

我试图匹配近似的Movie1& df1中的Movie2标题为df2中的实际标题。到目前为止我所做的是定义一个函数,该函数根据difflib中定义的字符串距离返回最接近的电影标题。

from difflib import SequenceMatcher as SM    
def find_closest(approx_title, movies):
        """Return the title in movies that is closest to the approximate title based on
        difflib distance
        """
        return max(movies, key = lambda title: SM(None, approx_title, title).ratio())

我的想法是将此功能应用于Movie1&中的每个元素。 Movie2并将其另存为新列。

movie_lst = rt_info['Title'].tolist()
movie_lst = [str(x) for x in movie_lst]
movie1_match = meets_df['Movie1'].apply(lambda movie: 
    find_closest(movie, movie_lst))

然而,这个函数运行得非常慢(movie_lst中有24883部电影),我认为将该函数应用为lambda可能效率低(与max函数中的lambda混合)如何优化此过程?谢谢!

0 个答案:

没有答案