每个可能范围的加权平均值

时间:2017-03-27 14:04:05

标签: pandas numpy

我正在寻找列的平均分数'得分' 加权重量'对于所有子范围:行0-1,0-2 ......,1-2,1-3 ......,2-3,2-4 ......等。

预期结果将是具有最高平均值的子范围。

df2 = pd.DataFrame(
    {'Weight': (2, 3, 4, 5, 2, 3, 4, 5),
    'Score': (6, 7, 8, 9, 6, 7, 8, 9)})

print(df2)

   Score  Weight
0      6       2
1      7       3
2      8       4
3      9       5
4      6       2
5      7       3
6      8       4
7      9       5

1 个答案:

答案 0 :(得分:2)

您可以在此处使用列表或生成器表达式(更喜欢后者)。

  • 首先,使用两个循环生成所有可能的范围,以定义开始和结束范围。
  • 第二,使用生成的索引生成所有平均值。
  • 最后,获取平均值最高的范围:

见下文:

# create column with weighted scores
df2["Weighted"] = df2["Score"] * df2["Weight"]

# create helper function for averaging
average = lambda indices: df2.loc[indices, "Weighted"].mean()

# generate all possible ranges
length = df2.shape[0] + 1 
ranges = (range(start, end)
          for start in range(length) 
          for end in range(start + 1, length))

# generate all averages
averages = ((indices, average(indices)) for indices in ranges)

# get highest average with value
high_range, high_value = max(averages, key=lambda x: x[1])

# show result
print("Range:", list(high_range), "Avg:", high_value)
Range: [3] Avg: 45.0

请注意,您的数据框需要以0开头的排序整数索引。否则,此解决方案无法正常工作,因为它使用range来爆炸索引的结构。

更详细地解释一下。仔细查看生成的范围:

ranges = (range(start, end)
          for start in range(length) 
          for end in range(start + 1, length))
print([list(x) for x in ranges])

[[0],
 [0, 1],
 [0, 1, 2],
 [0, 1, 2, 3],
 [0, 1, 2, 3, 4],
 [0, 1, 2, 3, 4, 5],
 [0, 1, 2, 3, 4, 5, 6],
 [0, 1, 2, 3, 4, 5, 6, 7],
 [1],
 [1, 2],
 [1, 2, 3],
 [1, 2, 3, 4],
 [1, 2, 3, 4, 5],
 [1, 2, 3, 4, 5, 6],
 [1, 2, 3, 4, 5, 6, 7],
 [2],
 [2, 3],
 [2, 3, 4],
 [2, 3, 4, 5],
 [2, 3, 4, 5, 6],
 [2, 3, 4, 5, 6, 7],
 [3],
 [3, 4],
 [3, 4, 5],
 [3, 4, 5, 6],
 [3, 4, 5, 6, 7],
 [4],
 [4, 5],
 [4, 5, 6],
 [4, 5, 6, 7],
 [5],
 [5, 6],
 [5, 6, 7],
 [6],
 [6, 7],
 [7]]

并且在平均值:

ranges = (range(start, end)
          for start in range(length) 
          for end in range(start + 1, length))
averages = ((indices, average(indices)) for indices in ranges)
print([list(x) for x in averages])

[[range(0, 1), 12.0],
 [range(0, 2), 16.5],
 [range(0, 3), 21.666666666666668],
 [range(0, 4), 27.5],
 [range(0, 5), 24.399999999999999],
 [range(0, 6), 23.833333333333332],
 [range(0, 7), 25.0],
 [range(0, 8), 27.5],
 [range(1, 2), 21.0],
 [range(1, 3), 26.5],
 [range(1, 4), 32.666666666666664],
 [range(1, 5), 27.5],
 [range(1, 6), 26.199999999999999],
 [range(1, 7), 27.166666666666668],
 [range(1, 8), 29.714285714285715],
 [range(2, 3), 32.0],
 [range(2, 4), 38.5],
 [range(2, 5), 29.666666666666668],
 [range(2, 6), 27.5],
 [range(2, 7), 28.399999999999999],
 [range(2, 8), 31.166666666666668],
 [range(3, 4), 45.0],
 [range(3, 5), 28.5],
 [range(3, 6), 26.0],
 [range(3, 7), 27.5],
 [range(3, 8), 31.0],
 [range(4, 5), 12.0],
 [range(4, 6), 16.5],
 [range(4, 7), 21.666666666666668],
 [range(4, 8), 27.5],
 [range(5, 6), 21.0],
 [range(5, 7), 26.5],
 [range(5, 8), 32.666666666666664],
 [range(6, 7), 32.0],
 [range(6, 8), 38.5],
 [range(7, 8), 45.0]]

编辑:多个最大范围

要获得所有最大范围(不只是一个),您需要稍微修改代码。因为我们必须在averages上迭代两次(首先找到最大平均值,然后将每个平均值与最大平均值进行比较),我将其转换为列表理解。

# generate all averages
averages = [(indices, df2.loc[indices, "Weighted"].mean()) 
            for indices in ranges]

max_average = max(averages, key=lambda x: x[1])[1]
highest = [tuples for tuples in averages if tuples[1] == max_average]

print(highest)
[(range(3, 4), 45.0), (range(7, 8), 45.0)]