I am trying to reshape a dataframe to create a kind of occurrence matrix but without success.
Is pandas.get_dummies()
the right way to do this at all ?
Here is what I tried so far
import pandas as pd
xlst_entries = [[u'aus', u'fra', u'gbr'],[u'gbr', u'prt'],[u'chn'],[u'bel', u'gbr'],[u'gbr', u'prt'],[u'gbr', u'prt'],[u'gbr', u'prt']]
qq1 = pd.DataFrame(xlst_entries)
qq2 = pd.get_dummies(data= qq1, prefix=None)
qq2
But the result I want is
index fra bel chn prt aus gbr
0 1 0 0 0 1 1
1 0 0 0 1 0 1
2 0 0 1 0 0 0
3 0 1 0 0 0 1
4 0 0 0 1 0 1
5 0 0 0 1 0 1
6 0 0 0 1 0 1
答案 0 :(得分:1)
You can do some preprocessing of xlst_entries
to combine all entries as a single string separated by |
, then use Series.str.get_dummies
:
xlst_entries = ['|'.join(x) for x in xlst_entries]
qq1 = pd.Series(xlst_entries).str.get_dummies()
The resulting output:
aus bel chn fra gbr prt
0 1 0 0 1 1 0
1 0 0 0 0 1 1
2 0 0 1 0 0 0
3 0 1 0 0 1 0
4 0 0 0 0 1 1
5 0 0 0 0 1 1
6 0 0 0 0 1 1
答案 1 :(得分:1)
You could tweak the parameters inside get_dummies
such that the prefix
of the columns formed is removed and sum the columns with same name to obtain the desired frame.
df = pd.get_dummies(df, prefix='', prefix_sep='')
df.groupby(df.columns, axis=1).agg(np.sum).astype(int)
aus bel chn fra gbr prt
0 1 0 0 1 1 0
1 0 0 0 0 1 1
2 0 0 1 0 0 0
3 0 1 0 0 1 0
4 0 0 0 0 1 1
5 0 0 0 0 1 1
6 0 0 0 0 1 1
答案 2 :(得分:1)
这是一个稍微通用的辅助函数,它几乎适用于任何data.frame(用python2编写,对于python3测试,请确保用{{1}包装map
和reduce
函数}}):
list