Question

I am trying to reshape a dataframe to create a kind of occurrence matrix but without success.

Is pandas.get_dummies() the right way to do this at all ?

Here is what I tried so far

import pandas as pd 

xlst_entries = [[u'aus', u'fra', u'gbr'],[u'gbr', u'prt'],[u'chn'],[u'bel', u'gbr'],[u'gbr', u'prt'],[u'gbr', u'prt'],[u'gbr', u'prt']]

qq1 = pd.DataFrame(xlst_entries)

qq2 = pd.get_dummies(data= qq1, prefix=None)
qq2

But the result I want is

index  fra  bel     chn     prt     aus     gbr

 0  1   0   0   0   1   1
 1  0   0   0   1   0   1
 2  0   0   1   0   0   0
 3  0   1   0   0   0   1
 4  0   0   0   1   0   1
 5  0   0   0   1   0   1
 6  0   0   0   1   0   1

Answer 1

You can do some preprocessing of xlst_entries to combine all entries as a single string separated by |, then use Series.str.get_dummies:

xlst_entries = ['|'.join(x) for x in xlst_entries]
qq1 = pd.Series(xlst_entries).str.get_dummies()

The resulting output:

   aus  bel  chn  fra  gbr  prt
0    1    0    0    1    1    0
1    0    0    0    0    1    1
2    0    0    1    0    0    0
3    0    1    0    0    1    0
4    0    0    0    0    1    1
5    0    0    0    0    1    1
6    0    0    0    0    1    1

Answer 2

You could tweak the parameters inside get_dummies such that the prefix of the columns formed is removed and sum the columns with same name to obtain the desired frame.

df = pd.get_dummies(df, prefix='', prefix_sep='')

df.groupby(df.columns, axis=1).agg(np.sum).astype(int)

   aus  bel  chn  fra  gbr  prt
0    1    0    0    1    1    0
1    0    0    0    0    1    1
2    0    0    1    0    0    0
3    0    1    0    0    1    0
4    0    0    0    0    1    1
5    0    0    0    0    1    1
6    0    0    0    0    1    1

Answer 3

这是一个稍微通用的辅助函数，它几乎适用于任何data.frame（用python2编写，对于python3测试，请确保用{{1}包装map和reduce函数}}）：

list

pandas : co-occurence matrix with get_dummies

3 个答案: