如何获取熊猫或numpy中列表列的唯一值,例如第二列
将导致“行动”,“犯罪”,“戏剧”。我能想到的最接近(但仍无法运行)的解决方案是:
genres = data['Genre'].unique()
但这可预料会导致TypeError,说明列表不可散列。
TypeError: unhashable type: 'list'
设置似乎是一个好主意,但是
genres = data.apply(set(), columns=['Genre'], axis=1)
但也会导致
TypeError: set() takes no keyword arguments
答案 0 :(得分:3)
如果您只想查找唯一值,建议您使用itertools.chain.from_iterable
来连接所有这些列表
import itertools
>>> np.unique([*itertools.chain.from_iterable(df.Genre)])
array(['action', 'crime', 'drama'], dtype='<U6')
甚至更快
>>> set(itertools.chain.from_iterable(df.Genre))
{'action', 'crime', 'drama'}
Timings
df = pd.DataFrame({'Genre':[['crime','drama'],['action','crime','drama']]})
df = pd.concat([df]*10000)
%timeit set(itertools.chain.from_iterable(df.Genre))
100 loops, best of 3: 2.55 ms per loo
%timeit set([x for y in df['Genre'] for x in y])
100 loops, best of 3: 4.09 ms per loop
%timeit np.unique([*itertools.chain.from_iterable(df.Genre)])
100 loops, best of 3: 12.8 ms per loop
%timeit np.unique(df['Genre'].sum())
1 loop, best of 3: 1.65 s per loop
%timeit set(df['Genre'].sum())
1 loop, best of 3: 1.66 s per loop
答案 1 :(得分:2)
您可以使用explode
:
data = pd.DataFrame([
{
"title": "The Godfather: Part II",
"genres": ["crime", "drama"],
"director": "Fracis Ford Coppola"
},
{
"title": "The Dark Knight",
"genres": ["action", "crime", "drama"],
"director": "Christopher Nolan"
}
])
# Changed from data.explode("genres")["genres"].unique() as suggested by rafaelc
data["genres"].explode().unique()
结果:
array(['crime', 'drama', 'action'], dtype=object)
答案 2 :(得分:1)
以下是一些选择:
# toy data
df = pd.DataFrame({'Genre':[['crime','drama'],['action','crime','drama']]})
np.unique(df['Genre'].sum())
# 109 µs ± 2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
set(df['Genre'].sum())
# 87 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
set([x for y in df['Genre'] for x in y])
# 11.8 µs ± 126 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
答案 3 :(得分:0)
如果您只是想提取信息而不是不添加回DataFrame,则可以在for循环中使用Python的set方法:
import pandas as pd
df = pd.DataFrame({'movie':[[1,2,3],[1,2,6]]})
out = set()
for row in df['movie']:
out.update({item for item in row})
print(out)
如果需要的话,也可以将其包装在apply调用中(返回None,但更新设置):
out = set()
df['movie'].apply(lambda x: out.update({item for item in x}))
我个人认为for循环更易于阅读。
答案 4 :(得分:0)
不确定它是否正是您想要的,但这将允许您将其转换为集合。
import pandas as pd
import numpy as np
df = pd.DataFrame({'Movie':['The Godfather', 'Dark Knight'], 'Genre': [['Crime', 'Drama'],['Crime', 'Drama', 'Action']]})
genres = []
for sublist in df['Genre']:
for item in sublist:
genres.append(item)
genre_set = set(genres)
print(genre_set)
输出:{'动作','戏剧','犯罪'}
答案 5 :(得分:0)
使用sets的功能实现链接的唯一性。 我已经在诸如envs这样的大数据中将这种技术用于庞大的列表中。主要优点是减少了生成最终平面清单所需的时间。
尝试:
from functools import reduce # for python 3
l = df.Genre.dropna().tolist()
sets = [ set(i) for i in l ]
final_set = reduce(lambda x, y: x.union(y), sets)