我有一个单词列表如下。
mylist = ['cat', 'yellow', 'car', 'red', 'green', 'jeep', 'rat','lorry']
我还有一个数据集中每篇文章的列表列表,其中包含“mylist”的值。如下面的例子所示(即,如果在论文中出现“mylist'”词,则会产生0-1之间的值。)
[[0,0.7,0,0,0,0.3,0,0.6], [0.2,0,0,0,0,0,0.8,0]]
换句话说,
[0,0.7,0,0,0,0.3,0,0.6] says that this only has values 'yellow', 'jeep', 'lorry'
现在我有一个类别字典如下。
mydictionary = {'colour': ['red', 'yellow', 'green'], 'animal': ['rat','cat'],
'vehicle': ['car', 'jeep']}
现在使用' mydictionary'键值我想按如下方式转换列表列表(也就是说,如果' mylist'的一个或多个值为1,我将键标记为平均值值得分)。
[[0.7, 0, 0.45], [0, 0.5, 0]]
换句话说,
[0.7, 0, 0.45] says that;
0.7 - average value for elements in 'colours' = 0.7/1 = 0.7
0 - no elements in 'animals'
0.45 - average value for elements in 'vehicles' = (0.3+0.6)/2 = 0.45
所以我的输出应该是如上所述的列表列表 - > [[0.7,0,0.45],[0,0.5,0]]
我很想知道是否可以使用pandas数据帧。
答案 0 :(得分:3)
您应该重新考虑您的数据结构。您将面临的一个问题是dict
本身就是无序的。首先,通过将值放在有序容器(list
中工作正常)来处理订单:
>>> vals = [mydictionary['colour'], mydictionary['animal'], mydictionary['vehicle']]
现在的文章:
>>> essays = [[0,0.7,0,0,0,0.3,0,0.6], [0.2,0,0,0,0,0,0.8,0]]
然后,一个简单的循环,构建从mylist
到每个论文权重的地图,并使用statistics
包为mean
函数:
>>> import statistics as stats
>>> result = []
>>> for essay in essays:
... map = dict(zip(mylist, essay))
... result.append([stats.mean(map[e] for e in v) for v in vals])
...
>>> result
[[0.2333333333333333, 0, 0.15], [0, 0.5, 0]]
老实说,不确定pandas
是否是最好的工具,但我想你可以使用这样的DataFrame
:
>>> df = pd.DataFrame({'essay{}'.format(i):essay for i, essay in enumerate(essays)}, index=mylist)
>>> df
essay0 essay1
cat 0.0 0.2
yellow 0.7 0.0
car 0.0 0.0
red 0.0 0.0
green 0.0 0.0
jeep 0.3 0.0
rat 0.0 0.8
lorry 0.6 0.0
然后,做一个石斑鱼映射:
>>> grouper = {v: k for k, vv in mydictionary.items() for v in vv}
然后使用pd.DataFrame.groupby
:
>>> df.groupby(grouper).mean()
essay0 essay1
animal 0.000000 0.5
colour 0.233333 0.0
vehicle 0.150000 0.0
在评论之后,修复非常简单,您只需将权重实现到列表中,过滤为0,如下所示:[map[e] for e in v if map[e]]
,然后获取该列表的mean
。但是,您必须注意列表不为空。只需定义一个帮助函数,它检查或返回默认值0:
>>> def mean_default(seq):
... if seq:
... return stats.mean(seq)
... else:
... return 0
...
然后简单地说:
>>> result = []
>>> for essay in essays:
... map = dict(zip(mylist, essay))
... result.append([mean_default([map[e] for e in v if map[e]]) for in vals])
对于pandas
,正如@IanS所示,只需将0
替换为np.nan
。
答案 1 :(得分:1)
首先,反转字典中的键和值:
{v: k for k, l in mydictionary.items() for v in l}
返回:
{'car': 'vehicle',
'cat': 'animal',
'green': 'colour',
'jeep': 'vehicle',
'rat': 'animal',
'red': 'colour',
'yellow': 'colour'}
第二次,将其映射为获取字词的类别:
df = pd.DataFrame(mylist, columns=['word'])
df['category'] = df['word'].map({v: k for k, l in mydictionary.items() for v in l})
输出:
# note: I have added lorry to the dictionary
word category
0 cat animal
1 yellow colour
2 car vehicle
3 red colour
4 green colour
5 jeep vehicle
6 rat animal
7 lorry vehicle
第三次,通过连接:
将其映射到您的列表列表df = pd.concat([
df,
pd.DataFrame([[0,0.7,0,0,0,0.3,0,0.6], [0.2,0,0,0,0,0,0.8,0]]).T
], axis=1)
第四次,按照catgegory分组:
df.groupby('category').mean()
输出:
0 1
category
animal 0.000000 0.5
colour 0.233333 0.0
vehicle 0.300000 0.0
编辑:忽略0值,将其替换为NaN。
df.replace({0: np.nan}).groupby('category').mean()
输出:
0 1
category
animal NaN 0.5
colour 0.70 NaN
vehicle 0.45 NaN
如果需要,您可以fillna(0)
。
答案 2 :(得分:1)
设置
mylist = ['cat', 'yellow', 'car', 'red', 'green', 'jeep', 'rat','lorry']
mydictionary = {
'colour': ['red', 'yellow', 'green'],
'animal': ['rat','cat'],
'vehicle': ['car', 'jeep', 'lorry']
}
a = np.array([[0,0.7,0,0,0,0.3,0,0.6], [0.2,0,0,0,0,0,0.8,0]])
选项1
简单!
mapping = {v: k for k, l in mydictionary.items() for v in l}
pd.DataFrame(a, columns=mylist).rename(columns=mapping).stack() \
.compress(lambda x: x > 0).groupby(level=[0, 1]).mean().unstack(fill_value=0)
animal colour vehicle
0 0.0 0.7 0.45
1 0.5 0.0 0.00
选项2
难以理解解决方案,但应该很快。
mapping = {v: k for k, l in mydictionary.items() for v in l}
f, u = pd.factorize([mapping[i] for i in mylist])
r = np.arange(a.shape[0]).repeat(a.shape[1])
c = np.tile(f, a.shape[0])
b = c + r * u.size
counts = np.bincount(b, a.ravel() > 0)
sums = np.bincount(b, a.ravel())
means = sums / np.where(counts > 0, counts, 1) * (counts > 0)
pd.DataFrame(means.reshape(-1, u.size), columns=u)
animal colour vehicle
0 0.0 0.7 0.45
1 0.5 0.0 0.00