我有一个熊猫DataFrame
,其列名为df[categories]
,如下所示:
0 ['ACCESSORIES', 'AUDIO', 'LOUNGE']
1 ['ACCESSORIES', 'MAJOR APPLIANCES', 'VISUAL']
2 ['BEDROOM SUITES', 'COMPUTERS', 'COMPUTERS', 'HOME OFFICE', 'HOME OFFICE', 'MAJOR APPLIANCES', 'VISUAL']
3 ['BEDDING', 'MAJOR APPLIANCES', 'MAJOR APPLIANCES', 'SMALL APPLIANCES', 'SMALL APPLIANCES']
4 [PATIO]
5 ['MAJOR APPLIANCES', 'SMALL APPLIANCES']
6 ['ACCESSORIES', 'MAJOR APPLIANCES', 'MAJOR APPLIANCES', 'SMALL APPLIANCES', 'SMALL APPLIANCES', 'VISUAL', 'VISUAL']
我需要遍历37000行的整个列,并将每个项目附加到集合中,因为我不想重复值。我尝试过:
categories = set()
categories = df['category'].apply(lambda a: set(a))
这将带回一个看起来像这样的熊猫系列:
0 {AUDIO, LOUNGE, ACCESSORIES}
1 {MAJOR APPLIANCES, ACCESSORIES, VISUAL}
2 {'BEDROOM SUITES', 'COMPUTERS', 'HOME OFFICE', 'MAJOR APPLIANCES', 'VISUAL'}
3 {'BEDDING', 'MAJOR APPLIANCES', 'SMALL APPLIANCES'}
4 {PATIO}
5 {'MAJOR APPLIANCES', 'SMALL APPLIANCES'}
6 {'ACCESSORIES', 'MAJOR APPLIANCES', 'SMALL APPLIANCES', 'VISUAL'}
如上所述,我实际上需要的是一个仅包含像这样的唯一值的列表:
[AUDIO, ACCESSORIES, BEDROOM, COMPUTERS,LOUNGE, MAJOR APPLIANCES, ... , VISUAL]
答案 0 :(得分:5)
如何?
set(df['category'].sum())
或者这个:
result = set()
df['category'].apply(result.update)
# Now the result is what you want
答案 1 :(得分:1)
您可以尝试以下方法:
import pandas as pd
categories = [['ACCESSORIES', 'AUDIO', 'LOUNGE'], ['ACCESSORIES', 'MAJOR APPLIANCES', 'VISUAL'], ['BEDROOM SUITES', 'COMPUTERS', 'COMPUTERS', 'HOME OFFICE', 'HOME OFFICE', 'MAJOR APPLIANCES', 'VISUAL'], ['BEDDING', 'MAJOR APPLIANCES', 'MAJOR APPLIANCES', 'SMALL APPLIANCES', 'SMALL APPLIANCES'], ['PATIO'], ['MAJOR APPLIANCES', 'SMALL APPLIANCES'], ['ACCESSORIES', 'MAJOR APPLIANCES', 'MAJOR APPLIANCES', 'SMALL APPLIANCES', 'SMALL APPLIANCES', 'VISUAL', 'VISUAL']]
df = pd.DataFrame({'category': categories})
print('pandas', pd.__version__)
sorted(set(df.category.explode()))
结果:
pandas 0.25.3
['ACCESSORIES',
'AUDIO',
'BEDDING',
'BEDROOM SUITES',
'COMPUTERS',
'HOME OFFICE',
'LOUNGE',
'MAJOR APPLIANCES',
'PATIO',
'SMALL APPLIANCES',
'VISUAL']