从项目列表构造索引/成员资格向量

时间:2017-06-07 18:06:49

标签: python pandas numpy

说你有以下篮子:

basket1 = ['apple', 'orange', 'banana']
basket2 = ['orange', 'grape']
basket3 = ['banana', 'grape', 'kiwi', 'orange']

baskets = [basket1, basket2, basket3]

您的目标是创建以下数据结构:

pd.DataFrame({'apple': {'basket1': 1,'basket2': 0,'basket3': 0 }, 'orange': {'basket1': 1,'basket2': 1,'basket3': 1 }, 'banana': {'basket1': 1,'basket2': 0,'basket3': 1 }, 'grape': {'basket1': 0,'basket2': 1,'basket3': 1 }, 'kiwi': {'basket1': 0,'basket2': 0,'basket3': 1 } })

如下所示:enter image description here

我知道来自集合的Counter和来自numpy的bincount,如果你只想要一个像上面那样的二进制列表,你可以利用它,但是你想要提出一些其他价值在以下每一点上:

例如,假设在每个点上,而不是1,你想要将你碰巧拥有的水果的重量放在另一个表中:

pd.DataFrame({'weight': {'apple': 3, 'orange':3, 'banana':2, 'grape':1, 'kiwi':2}})

enter image description here

你想要的结果是:

pd.DataFrame({'apple': { 'basket1': 3, 'basket2': 0, 'basket3': 0 }, 'orange': { 'basket1': 3, 'basket2': 3, 'basket3': 3 }, 'banana': { 'basket1': 2, 'basket2': 0, 'basket3': 2 }, 'grape': { 'basket1': 0, 'basket2': 1, 'basket3': 1 }, 'kiwi': { 'basket1': 0, 'basket2': 0, 'basket3': 2 } })

你会如何干净地编写这样的操作?我不太确定如何有效或好地执行此操作。

1 个答案:

答案 0 :(得分:2)

假设您开始使用pd.Dataframedict

In [37]: df1
Out[37]:
         apple  banana  grape  kiwi  orange
basket1      1       1      0     0       1
basket2      0       0      1     0       1
basket3      0       1      1     1       1

In [38]: mapper = {'apple': 3, 'orange':3, 'banana':2, 'grape':1, 'kiwi':2}

然后简单地说:

In [39]: for colname in df1:
    ...:     df1[colname] = df1[colname]*mapper[colname]
    ...:

In [40]: df1
Out[40]:
         apple  banana  grape  kiwi  orange
basket1      3       2      0     0       3
basket2      0       0      1     0       3
basket3      0       2      1     2       3

或者更简单地说,您可以通过pd.DataFrame(即数据框的"列")智能地显示pd.Series

In [5]: df2 = pd.DataFrame({'weight': {'apple': 3, 'orange':3, 'banana':2, 'grap
   ...: e':1, 'kiwi':2}})

In [6]: mapper = df2.squeeze() # convert to series

In [7]: df1*mapper
Out[7]:
         apple  banana  grape  kiwi  orange
basket1      3       2      0     0       3
basket2      0       0      1     0       3
basket3      0       2      1     2       3

或从头开始:

In [8]: basket1 = ['apple', 'orange', 'banana']
   ...: basket2 = ['orange', 'grape']
   ...: basket3 = ['banana', 'grape', 'kiwi', 'orange']
   ...:
   ...: baskets = [basket1, basket2, basket3]
   ...:

In [9]: fruitvolume = {'apple': 3, 'orange':3, 'banana':2, 'grape':1, 'kiwi':2}

然后简单地说:

In [12]: data = [{item:fruitvolume[item] for item in basket} for basket in baskets]

In [13]: data
Out[13]:
[{'apple': 3, 'banana': 2, 'orange': 3},
 {'grape': 1, 'orange': 3},
 {'banana': 2, 'grape': 1, 'kiwi': 2, 'orange': 3}]

In [14]: pd.DataFrame(data)
Out[14]:
   apple  banana  grape  kiwi  orange
0    3.0     2.0    NaN   NaN       3
1    NaN     NaN    1.0   NaN       3
2    NaN     2.0    1.0   2.0       3

但现在你必须做一些重复......

In [16]: df = df.fillna(0).astype(int)

In [17]: df
Out[17]:
   apple  banana  grape  kiwi  orange
0      3       2      0     0       3
1      0       0      1     0       3
2      0       2      1     2       3