Question

我正在尝试加快python代码段的速度。

给出两个相等大小（numpy）的数组，目标是找到一个数组（例如a）中的值的平均值，该平均值对应于另一个数组（例如b）的值。数组的索引是同步的。

例如；

<% if @notifications.count > 0 %>
  <ul>
    <% @notifications.each do |notification| %>
      <li>
        <%= link_to <span class="notification-title"><%= notification.title %></span>, notification_path(notification)  %>
        <span class="notification-message"><%= notification.message %></span>
        <span class="notification-time"><%= notification.created_at.strftime("%B %e at %l:%m%P") %></span>
      </li>
    <% end %>
  </ul>
  <div class="notifications-preview-footer">
    <%= link_to "See All", notifications_path %>
  </div>
<% else %>
  <ul>
    <li>No Notifications</li>
  </ul>
<% end %>

a = np.array([1, 1, 1, 2, 2, 2]) b = np.array([10, 10, 10, 20, 20, 20])中有两个不同的值1和2。a中b中有一个相同索引处的值为“ 1”的值为[10，10， 10]。因此，平均值（1）为10。类似地，平均值（2）为20。

我们可以假设a中不同的值集是先验的。 a中的值不必是连续的，顺序是随机的。我选择这样的示例只是为了简化说明。

这是我的处理方式：

# Accumulate the total sum and count for index, val in np.ndenumerate(a): val_to_sum[val] += b[index] val_to_count[val] += 1 # Calculate the mean for val in val_to_sum.keys(): if val_to_count[val]: # skip vals with zero count val_to_mean[val] = val_to_sum[val] / val_to_count[val]和val_to_sum是根据val_to_count中可见的已知值列表（在本例中为1和2）初始化为零的字典。

我怀疑这是最快的计算方法。我希望列表会很长，比如说几百万，可能值的集合大约是几十。

如何加快计算速度？

解决方案可以吗？ 受到以下答案之一的启发，这可以做到：

Answer 1

也许这样的事情会起作用：

import numpy as np

a = np.array([1, 1, 1, 2, 2, 2])
b = np.array([10, 10, 10, 20, 20, 20])

np.average(b[a==1])
np.average(b[a==2])

对于更大的数据集：

import numpy as np

a = np.random.randint(1,30,1000000)
b = np.random.random(size=1000000)

for x in set(a):
  print("Average for values marked {0}: {1}".format(x,np.average(b[a==x])))

Answer 2

您可以一次浏览列表：

means_dict = {}
for i in range(len(a)):
    val = a[i]
    n = b[i]
    if val not in means_dict.keys():
        means_dict[val] = np.array([0.0,0.0])
    arr = means_dict[val]
    arr[0] = arr[0] * (arr[1] / (arr[1] + 1)) + n * (1 / (arr[1] + 1))
    arr[1] = arr[1] + 1

计算每个值的移动平均值。最后，您将获得一个dict，其中包含每个值的平均值和计数。

编辑：
实际上，玩耍表明这是最好的结果：

def f3(a,b):
    means = {}
    for val in set(a):
      means[val] = np.average(b[a==val]) 
    return means

与您建议的内容相似，只是遍历set，节省了很多时间。

Answer 3

可以通过删除重复项来完成：因此，请尝试以下操作：

from collections import OrderedDict
import numpy as np
a = np.array([1, 1, 1, 2, 2, 2])
b = np.array([10, 10, 10, 20, 20, 20])

a=list(OrderedDict.fromkeys(a))
b=list(OrderedDict.fromkeys(b))  
print(b)

如果b中的元素不同，请使用

import pandas as pd
import numpy as np
a = np.array([1, 1, 1, 2, 2, 2])
b = np.array([10, 10, 10, 20, 20, 20])   
d = {}

for l, n in zip(a, b):
    d.setdefault(l, []).append(n)

for key in d:
    print key, sum(d[key])/len(d[key])

代码：https://onlinegdb.com/BJih3DplE

Python中的平均计算

3 个答案: