Python Pandas Dataframe列列表,获取交集并将函数应用于另一列

时间:2015-11-11 20:39:47

标签: python list pandas dataframe intersection

问题数据

df = pd.DataFrame({'Keyword': ['basement finishing systems akron pa', 'basement finishing systems biglerville pa', 'basement finishing systems chambersburg pa', 'basement finishing systems christiana pa', 'basement finishing systems delta pa'], 'StemmedKW': [['basement', 'finish', 'system', 'akron', 'pa'], ['basement', 'finish', 'system', 'biglervil', 'pa'], ['basement', 'finish', 'system', 'chambersburg', 'pa'], ['basement', 'finish', 'system', 'christiana', 'pa'], ['basement', 'finish', 'system', 'delta', 'pa']], 'Ad Group': ['Finishing System', 'Finishing System', 'Finishing System', 'Finishing System', 'Finishing System'], 'Campaign': ['Campaign A', 'Campaign A', 'Campaign A', 'Campaign A', 'Campaign A'], 'StemmedAG': [['finish', 'system'], ['finish', 'system'], ['finish', 'system'], ['finish', 'system'], ['finish', 'system']]}, columns=['Campaign', 'Ad Group', 'Keyword', 'StemmedAG', 'StemmedKW'])

数据框看起来像这样

     Campaign          Ad Group                                     Keyword  \
0  Campaign A  Finishing System         basement finishing systems akron pa   
1  Campaign A  Finishing System   basement finishing systems biglerville pa   
2  Campaign A  Finishing System  basement finishing systems chambersburg pa   
3  Campaign A  Finishing System    basement finishing systems christiana pa   
4  Campaign A  Finishing System         basement finishing systems delta pa   

          StemmedAG                                     StemmedKW  
0  [finish, system]         [basement, finish, system, akron, pa]  
1  [finish, system]     [basement, finish, system, biglervil, pa]  
2  [finish, system]  [basement, finish, system, chambersburg, pa]  
3  [finish, system]    [basement, finish, system, christiana, pa]  
4  [finish, system]         [basement, finish, system, delta, pa] 

上下文

StemmedAGStemmedKW是列表列。我通过词汇Ad GroupKeyword列生成了这些列。目标是在+列中的关键字前面添加一个加号Keyword,用于StemmedAGStemmedKW中显示的任何字词。

结果

请注意row 0 Keyword的值basement +finishing +systems akron pa是多少?这是因为单词finishsystem都出现在StemmedAGStemmedKW中。因此,加号会放在Keyword列中的非词干词之前。

     Campaign          Ad Group                                       Keyword  \
0  Campaign A  Finishing System         basement +finishing +systems akron pa   
1  Campaign A  Finishing System   basement +finishing +systems biglerville pa   
2  Campaign A  Finishing System  basement +finishing +systems chambersburg pa   
3  Campaign A  Finishing System    basement +finishing +systems christiana pa   
4  Campaign A  Finishing System         basement +finishing +systems delta pa   

              StemmedAG                                          StemmedKW  
0  ['finish', 'system']    ['basement', 'finish', 'system', 'akron', 'pa']  
1  ['finish', 'system']  ['basement', 'finish', 'system', 'biglervil', ...  
2  ['finish', 'system']  ['basement', 'finish', 'system', 'chambersburg...  
3  ['finish', 'system']  ['basement', 'finish', 'system', 'christiana',...  
4  ['finish', 'system']    ['basement', 'finish', 'system', 'delta', 'pa'] 

我不习惯在Pandas列中使用lists,并且不知道如何从lists中的两列中获取dataframe的交集,然后获取单词出现位置的索引,然后在每个找到的索引的前面应用加号。或者更简单的是使用df['Keyword']中的单词StemmedAG替换字符串?

我也想尽可能地做大熊猫,避免for循环。

1 个答案:

答案 0 :(得分:0)

我想出了如何用非熊猫方法实现这一目标,但它非常讨厌。我真的希望学习如何用熊猫做这件事(如果它甚至可能!)

for idx in df.index:
    intersect = list(set(df['StemmedAG'][idx]).intersection(df['StemmedKW'][idx]))
    positions = [i for word in intersect for i, j in enumerate(df['StemmedKW'][idx]) if j == word]
    df.loc[idx, 'Keyword'] = ' '.join(["+"+word if df['Keyword'][idx].split().index(word) in positions else word for word in df['Keyword'][idx].split()])