我的熊猫数据框为:
word_list
['nuclear','election','usa','baseball']
['football','united','thriller']
['marvels','hollywood','spiderman']
....................
....................
....................
我还有多个带有类别名称的列表,例如:-
movies=['spiderman','marvels','thriller']'
sports=['baseball','hockey','football']
,
politics=['election','china','usa']
和许多其他类别。
所有我想将pandas列word_list
的关键字与我的类别列表匹配,并且如果关键字被匹配在一起并且如果任何关键字在任何列表中都没有匹配,则在单独的列中分配相应的列表名称。放置为miscellaneous
,因此,我正在寻找以下输出:-
word_list matched_list_names
['nuclear','election','usa','baseball'] politics,sports,miscellaneous
['football','united','thriller'] sports,movies,miscellaneous
['marvels','spiderman','hockey'] movies,sports
.................... .....................
.................... .....................
.................... ....................
我成功获得了match关键字为:-
for i in df['word_list']:
for j in movies:
if i in j:
print (i)
但这给了我匹配关键字的列表。如何获取列表名称并将其添加到pandas列?
答案 0 :(得分:3)
您可以先展平列表字典,然后用.get
用miscellaneous
查找不匹配的值,然后转换为set
s以查找唯一类别并转换为string
由join
来:
movies=['spiderman','marvels','thriller']
sports=['baseball','hockey','football']
politics=['election','china','usa']
d = {'movies':movies, 'sports':sports, 'politics':politics}
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
f = lambda x: ','.join(set([d1.get(y, 'miscellaneous') for y in x]))
df['matched_list_names'] = df['word_list'].apply(f)
print (df)
word_list matched_list_names
0 [nuclear, election, usa, baseball] politics,miscellaneous,sports
1 [football, united, thriller] miscellaneous,sports,movies
2 [marvels, hollywood, spiderman, budget] miscellaneous,movies
具有列表理解功能的类似解决方案:
df['matched_list_names'] = [','.join(set([d1.get(y, 'miscellaneous') for y in x]))
for x in df['word_list']]
答案 1 :(得分:1)
首先,我认为您应该利用从集合和词典中进行O(1)
的查找。就是说,我将数据设置为(注意设置了值):
d = dict(movies={'spiderman','marvels','thriller'},
sports={'baseball','hockey','football'},
politics={'election','china','usa'})
然后,您可以使用自定义逻辑transform
{1}}
def f(r):
def m(r_):
_ = [k for (k, v) in d.items() if r_ in v]
return _ if _ else ['Misc']
return {item for z in [m(r_) for r_ in r] for item in z}
df.word_list.transform(f)
0 {Misc, sports, politics}
1 {Misc, sports, movies}
2 {Misc, movies}
对于300000行,
%timeit df.word_list.transform(f)
1.1 s ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
那不是很好但是可行的。