Question

我有一个数据框df，其中包含可以重复列Col中的值的事务。我使用计数器dictionary1计算每个Col值的频率，然后我想对数据的子集运行for循环并获得值pit。我想创建一个新词典dict1，其中键是来自dictionary1的键，值是pit的值。这是我到目前为止的代码：

dictionary1 = Counter(df['Col'])
dict1 = defaultdict(int)

for i in range(len(dictionary1)):       
    temp = df[df['Col'] == dictionary1.keys()[i]]
    b = temp['IsBuy'].sum()
    n = temp['IsBuy'].count()
    pit = b/n
    dict1[dictionary1.keys()[i]] = pit

我的问题是，如何根据dict1的密钥和dictionary1计算得到的值为pit分配密钥和值。换句话说，在上面的脚本中编写最后一行代码的正确方法是什么。

谢谢。

Answer 1

由于您正在使用pandas，我应该指出您遇到的问题很常见，因为它有一种内置的方法。我们称之为收集＆＃34;类似的＆＃34;将数据分组，然后对它们执行groupby操作。阅读关于groupby split-apply-combine成语的教程部分可能很糟糕 - 你可以做很多巧妙的事情！

计算pit值的可行方法类似于

df.groupby("Col")["IsBuy"].mean()

例如：

>>> # make dummy data
>>> N = 10**4
>>> df = pd.DataFrame({"Col": np.random.randint(1, 10, N), "IsBuy": np.random.choice([True, False], N)})
>>> df.head()
   Col  IsBuy
0    3  False
1    6   True
2    6   True
3    1   True
4    5   True
>>> df.groupby("Col")["IsBuy"].mean()
Col
1      0.511709
2      0.495697
3      0.489796
4      0.510658
5      0.507491
6      0.513183
7      0.522936
8      0.488688
9      0.490498
Name: IsBuy, dtype: float64

如果您坚持，可以将其变成系列词典：

>>> df.groupby("Col")["IsBuy"].mean().to_dict()
{1: 0.51170858629661753, 2: 0.49569707401032703, 3: 0.48979591836734693, 4: 0.51065801668211308, 5: 0.50749063670411987, 6: 0.51318267419962338, 7: 0.52293577981651373, 8: 0.48868778280542985, 9: 0.49049773755656106}

如何在Python中复制另一个字典中的唯一键和值

1 个答案: