我有一个熊猫数据框A
,列keywords
为:-
keywords
['loans','mercedez','bugatti','a4']
['trump','usa','election','president']
['galaxy','7s','canon','macbook']
['beiber','spiderman','marvels','ironmen']
.........................................
.........................................
.........................................
我还有另一个熊猫数据框B
,其中列category
和words
是逗号分隔的字符串,如:-
category words
audi audi a4,audi a6
bugatti bugatti veyron, bugatti chiron
mercedez mercedez s-class, mercedez e-class
dslr canon, nikon
apple iphone 7s,iphone 6s,iphone 5
finance sales,loans,sales price
politics donald trump, election, votes
entertainment spiderman,captain america, ironmen
music justin beiber, rihana,drake
........ ..............
......... .........
所有我想将dataframe A
列keywords
与dataframe B
列words
映射并分配一个对应的category
。 keywords
列的映射应与列word
的字符串中的每个单词匹配。例如:-关键字a4
应该与列audi a4
中的字符串words
中的两个单词匹配。预期结果将是:-
keywords matched_category
['loans','mercedez','bugatti','a4'] ['finance','mercedez','mercedez','bugatti','bugatti','audi']
['trump','usa','election','president'] ['politics','politics']
['galaxy','7s','canon','macbook'] ['apple','dslr']
['beiber','spiderman','marvels','ironmen'] ['music','entertaiment','entertainment','entertainment']
答案 0 :(得分:0)
一种方法是使用pandas.transform:
import pandas as pd
A = pd.DataFrame({'keywords': [['loans','mercedez','bugatti','a4'],
['trump','usa','election','president']]})
B = pd.DataFrame({'category': ['audi', 'finance'],
'words': ['audi a4,audi a6', 'sales,loans,sales price']})
def match_category_to_keywords(kws):
ret = []
for kw in kws:
m = B['words'].transform(lambda words: any([kw in w for w in words.split(',')]))
ret.extend(B['category'].loc[m].tolist())
return pd.np.unique(ret)
A['matched_category'] = A['keywords'].transform(lambda kws: match_category_to_keywords(kws))
print(A)
输出:
keywords matched_category
0 [loans, mercedez, bugatti, a4] [audi, finance]
1 [trump, usa, election, president] []
答案 1 :(得分:0)
希望您可以使用:
from operator import itemgetter
from itertools import groupby
d = [['4027221', 'MX', '0.4', 3],
['4027221', 'MX', '30', 1],
['4027222', 'MX', '0.4', 3],
['4027222', 'MX', '30', 1]]
d.sort()
d = [min(g, key=lambda s: s[-2]) for _, g in groupby(d, key=lambda s: s[:-2])]
[['4027221', 'MX', '0.4', 3], ['4027222', 'MX', '0.4', 3]]
#create dictionary by split comma and whitespaces
d = df2.set_index('category')['words'].str.split(',\s*|\s+').to_dict()
#flatten lists to dictionary
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
print (d1)
{'audi': 'audi', 'a4': 'audi', 'a6': 'audi', 'bugatti': 'bugatti',
'veyron': 'bugatti', 'chiron': 'bugatti', 'mercedez': 'mercedez',
's-class': 'mercedez', 'e-class': 'mercedez', 'canon': 'dslr',
'nikon': 'dslr', 'iphone': 'apple', '7s': 'apple', '6s': 'apple',
'5': 'apple', 'sales': 'finance', 'loans': 'finance', 'price': 'finance',
'donald': 'politics', 'trump': 'politics', 'election': 'politics',
'votes': 'politics', 'spiderman': 'entertainment', 'captain': 'entertainment',
'america': 'entertainment', 'ironmen': 'entertainment', 'justin': 'music',
'beiber': 'music', 'rihana': 'music', 'drake': 'music'}