将列表元素映射到python中十进制值的字典中的键

时间:2017-09-01 06:53:25

标签: python pandas

我有一个单词列表如下。

mylist = ['cat', 'yellow', 'car', 'red', 'green', 'jeep', 'rat','lorry']

我还有一个数据集中每篇文章的列表列表,其中包含“mylist”的值。如下面的例子所示(即,如果在论文中出现“mylist'”词,则会产生0-1之间的值。)

[[0,0.7,0,0,0,0.3,0,0.6], [0.2,0,0,0,0,0,0.8,0]]

换句话说,

[0,0.7,0,0,0,0.3,0,0.6] says that this only has values 'yellow', 'jeep', 'lorry'

现在我有一个类别字典如下。

mydictionary = {'colour': ['red', 'yellow', 'green'], 'animal': ['rat','cat'], 
'vehicle': ['car', 'jeep']}

现在使用' mydictionary'键值我想按如下方式转换列表列表(也就是说,如果' mylist'的一个或多个值为1,我将键标记为平均值值得分)。

[[0.7, 0, 0.45], [0, 0.5, 0]]

换句话说,

[0.7, 0, 0.45] says that;
0.7 - average value for elements in 'colours' = 0.7/1 = 0.7
0 - no elements in 'animals'
0.45 - average value for elements in 'vehicles' = (0.3+0.6)/2 = 0.45

所以我的输出应该是如上所述的列表列表 - > [[0.7,0,0.45],[0,0.5,0]]

我很想知道是否可以使用pandas数据帧。

3 个答案:

答案 0 :(得分:3)

您应该重新考虑您的数据结构。您将面临的一个问题是dict本身就是无序的。首先,通过将值放在有序容器(list中工作正常)来处理订单:

>>> vals = [mydictionary['colour'], mydictionary['animal'], mydictionary['vehicle']]

现在的文章:

>>> essays = [[0,0.7,0,0,0,0.3,0,0.6], [0.2,0,0,0,0,0,0.8,0]]

然后,一个简单的循环,构建从mylist到每个论文权重的地图,并使用statistics包为mean函数:

>>> import statistics as stats
>>> result = []
>>> for essay in essays:
...     map = dict(zip(mylist, essay))
...     result.append([stats.mean(map[e] for e in v) for v in vals])
...
>>> result
[[0.2333333333333333, 0, 0.15], [0, 0.5, 0]]

老实说,不确定pandas是否是最好的工具,但我想你可以使用这样的DataFrame

>>> df = pd.DataFrame({'essay{}'.format(i):essay for i, essay in enumerate(essays)}, index=mylist)
>>> df
        essay0  essay1
cat        0.0     0.2
yellow     0.7     0.0
car        0.0     0.0
red        0.0     0.0
green      0.0     0.0
jeep       0.3     0.0
rat        0.0     0.8
lorry      0.6     0.0

然后,做一个石斑鱼映射:

>>> grouper  = {v: k for k, vv in mydictionary.items() for v in vv}

然后使用pd.DataFrame.groupby

>>> df.groupby(grouper).mean()
           essay0  essay1
animal   0.000000     0.5
colour   0.233333     0.0
vehicle  0.150000     0.0

编辑

在评论之后,修复非常简单,您只需将权重实现到列表中,过滤为0,如下所示:[map[e] for e in v if map[e]],然后获取该列表的mean。但是,您必须注意列表不为空。只需定义一个帮助函数,它检查或返回默认值0:

>>> def mean_default(seq):
...     if seq:
...         return stats.mean(seq)
...     else:
...         return 0
...

然后简单地说:

>>> result = []
>>> for essay in essays:
...     map = dict(zip(mylist, essay))
...     result.append([mean_default([map[e] for e in v if map[e]]) for  in vals])

对于pandas,正如@IanS所示,只需将0替换为np.nan

答案 1 :(得分:1)

首先,反转字典中的键和值:

{v: k for k, l in mydictionary.items() for v in l}

返回:

{'car': 'vehicle',
 'cat': 'animal',
 'green': 'colour',
 'jeep': 'vehicle',
 'rat': 'animal',
 'red': 'colour',
 'yellow': 'colour'}

第二次,将其映射为获取字词的类别:

df = pd.DataFrame(mylist, columns=['word'])
df['category'] = df['word'].map({v: k for k, l in mydictionary.items() for v in l})

输出:

# note: I have added lorry to the dictionary
     word category
0     cat   animal
1  yellow   colour
2     car  vehicle
3     red   colour
4   green   colour
5    jeep  vehicle
6     rat   animal
7   lorry  vehicle

第三次,通过连接:

将其映射到您的列表列表
df = pd.concat([
    df,
    pd.DataFrame([[0,0.7,0,0,0,0.3,0,0.6], [0.2,0,0,0,0,0,0.8,0]]).T
], axis=1)

第四次,按照catgegory分组:

df.groupby('category').mean()

输出:

                 0    1
category               
animal    0.000000  0.5
colour    0.233333  0.0
vehicle   0.300000  0.0

编辑:忽略0值,将其替换为NaN。

df.replace({0: np.nan}).groupby('category').mean()

输出:

             0    1
category           
animal     NaN  0.5
colour    0.70  NaN
vehicle   0.45  NaN

如果需要,您可以fillna(0)

答案 2 :(得分:1)

设置

mylist = ['cat', 'yellow', 'car', 'red', 'green', 'jeep', 'rat','lorry']
mydictionary = {
    'colour': ['red', 'yellow', 'green'],
    'animal': ['rat','cat'], 
    'vehicle': ['car', 'jeep', 'lorry']
}
a = np.array([[0,0.7,0,0,0,0.3,0,0.6], [0.2,0,0,0,0,0,0.8,0]])

选项1
简单!

mapping = {v: k for k, l in mydictionary.items() for v in l}

pd.DataFrame(a, columns=mylist).rename(columns=mapping).stack() \
    .compress(lambda x: x > 0).groupby(level=[0, 1]).mean().unstack(fill_value=0)

   animal  colour  vehicle
0     0.0     0.7     0.45
1     0.5     0.0     0.00

选项2
难以理解解决方案,但应该很快。

mapping = {v: k for k, l in mydictionary.items() for v in l}
f, u = pd.factorize([mapping[i] for i in mylist])
r = np.arange(a.shape[0]).repeat(a.shape[1])
c = np.tile(f, a.shape[0])
b = c + r * u.size

counts = np.bincount(b, a.ravel() > 0)
sums = np.bincount(b, a.ravel())
means = sums / np.where(counts > 0, counts, 1) * (counts > 0)

pd.DataFrame(means.reshape(-1, u.size), columns=u)

   animal  colour  vehicle
0     0.0     0.7     0.45
1     0.5     0.0     0.00