我有两个数据帧
(1st Dataframe)
**Sentences**
hello world
live in the world
haystack in the needle
(2nd Dataframe in descending order by Weight)
**Words** **Weight**
world 80
hello 60
haystack 40
needle 20
我想检查第一个数据框中的每个句子,如果句子中的任何单词包含第二个数据框中列出的单词,并选择具有最高权重数字的单词。然后,我将为第一个数据帧分配找到的最高权重字。所以结果应该是:
**Sentence** **Assigned Word**
hello world world
live in the world world
needle in the haystack haystack
我想过使用两个for循环,但如果有数百万个句子或单词,性能可能会很慢。在python中执行此操作的最佳方法是什么?谢谢!
答案 0 :(得分:0)
groupby.head(1)
这种方法涉及几个步骤,但它是我能想到的最好的熊猫式方法。
import pandas as pd
import numpy as np
list1 = ['hello world',
'live in the world',
'haystack in the needle']
list2 = [['world',80],
['hello',60],
['haystack',40],
['needle',20]]
df1 = pd.DataFrame(list1,columns=['Sentences'])
df2 = pd.DataFrame(list2,columns=['Words','Weight'])
# Creating a new column `Word_List`
df1['Word_List'] = df1['Sentences'].apply(lambda x : x.split(' '))
# Need a common key for cartesian product
df1['common_key'] = 1
df2['common_key'] = 1
# Cartesian Product
df3 = pd.merge(df1,df2,on='common_key',copy=False)
# Filtering only words that matched
df3['Match'] = df3.apply(lambda x : x['Words'] in x['Word_List'] ,axis=1)
df3 = df3[df3['Match']]
# Sorting values by sentences and weight
df3.sort_values(['Sentences','Weight'],axis=0,inplace=True,ascending=False)
# Keeping only the first element in each group
final_df = df3.groupby('Sentences').head(1).reset_index()[['Sentences','Words']]
final_df
输出:
Sentences Words
0 live in the world world
1 hello world world
2 haystack in the needle haystack
性能:
10 loops, best of 3: 41.5 ms per loop