我有一段代码,它需要一个大型数据集并将其转换为比例较小的数据集,让我解释一下:
假设你有20个蓝色大理石和10个红色大理石,如果我想用3个大理石代表这个数据我会用2个蓝色和1个红色大理石。
我不介意它是不是确切的例如代表17蓝色和16红色与4弹珠。比例代表它的最接近的方式是2蓝色和2红色,这没关系
这是我在python中的代码:
from random import randrange
data_set = [randrange(100, 1000) for x in range(5)]
required_amount = 20
special_number = required_amount / sum(data_set)
proportional_data_set = [round(x * special_number) for x in data_set]
print(data_set)
print(required_amount)
print(proportional_data_set)
print(sum(proportional_data_set))
问题在于我说所需的样本是20,但有时比例数据集总共会得到21或19。我假设这是因为一些舍入错误,但有没有人知道更好的方法来解决这个问题?
正确运行的示例输出将是:
[832, 325, 415, 385, 745]
20
[6, 2, 3, 3, 6]
20
工作不正确的示例是:
[414, 918, 860, 978, 438]
20
[2, 5, 5, 5, 2]
19
如果有人知道任何类似的方法做这样的事情也会很棒。
答案 0 :(得分:3)
这是解决问题的一种方法。将special_number
计算为data_set
每个“大理石”中的单位数。然后使用divmod()计算比例数量和余数。由于divmod()返回一个整数商,因此在大多数情况下sum(proportional_data_set)
将小于required_amount
。
最后,使用循环查找最高余数并将proportional_data_set递增到sum(proportional_data_set) = required_amount
。
from random import randrange
data_set = [randrange(100, 1000) for x in range(5)]
required_amount = 20
special_number = sum(data_set) // required_amount
print("Data set:")
print(data_set)
print("Special number:")
print(special_number)
# divmod() returns a pair of numbers, split them into quotients and remainders
pairs = [divmod(x, special_number) for x in data_set]
proportional_data_set = [x[0] for x in pairs]
remainder = [x[1] for x in pairs]
print
print("Proportional data set before adjusting:")
print(proportional_data_set), "=", sum(proportional_data_set)
print("Remainders:")
print(remainder)
while sum(proportional_data_set) < required_amount:
i = remainder.index(max(remainder)) # index of the highest remainder
proportional_data_set[i] += 1 # add another marble to this index
remainder[i] = -1 # don't use this remainder again
print
print("Proportional data set after adjusting:")
print(proportional_data_set), "=", sum(proportional_data_set)
print("Remainders:")
print(remainder)
输出如下:
Data set:
[546, 895, 257, 226, 975]
Special number:
144
Proportional data set before adjusting:
[3, 6, 1, 1, 6] = 17
Remainders:
[114, 31, 113, 82, 111]
Proportional data set after adjusting:
[4, 6, 2, 1, 7] = 20
Remainders:
[-1, 31, -1, 82, -1]
最高剩余部分用于递增比例数据集,然后设置为-1。
答案 1 :(得分:2)
我打算在输入数据的累积总和与比例输出值的累积总和之间提供基于Bresenham线的解决方案但是(a)结果给出了错误的答案 - 见下文 - 和( b)我相信@ tzaman指向Allocate an array of integers proportionally compensating for rounding errors的指针提供了一个比我对Bresenham方法所做的任何更正更简单的解决方案(proportional()
函数由@Dr.Goulu提供):
def proportional(nseats,votes):
"""assign n seats proportionaly to votes using Hagenbach-Bischoff quota
:param nseats: int number of seats to assign
:param votes: iterable of int or float weighting each party
:result: list of ints seats allocated to each party
"""
quota=sum(votes)/(1.+nseats) #force float
frac=[vote/quota for vote in votes]
res=[int(f) for f in frac]
n=nseats-sum(res) #number of seats remaining to allocate
if n==0: return res #done
if n<0: return [min(x,nseats) for x in res] # see siamii's comment
#give the remaining seats to the n parties with the largest remainder
remainders=[ai-bi for ai,bi in zip(frac,res)]
limit=sorted(remainders,reverse=True)[n-1]
#n parties with remainter larger than limit get an extra seat
for i,r in enumerate(remainders):
if r>=limit:
res[i]+=1
n-=1 # attempt to handle perfect equality
if n==0: return res #done
raise #should never happen
print (proportional(20,[832, 325, 415, 385, 745]))
print (proportional(20,[414, 918, 860, 978, 438]))
...给出输出:
[6, 2, 3, 3, 6]
[2, 5, 5, 6, 2]
......视需要而定。
对于那些可能对Bresenham线(非)解决方案感兴趣的人,这里是基于代码here:
import itertools, operator
def bresenhamLine(x0, y0, x1, y1):
dx = abs(x1 - x0)
dy = abs(y1 - y0)
sx = x0 < x1 and 1 or -1
sy = y0 < y1 and 1 or -1
err = dx - dy
points = []
x, y = x0, y0
while True:
points += [(x, y)]
if x == x1 and y == y1:
break
e2 = err * 2
if e2 > -dy:
err -= dy
x += sx
if e2 < dx:
err += dx
y += sy
return points
def proportional(n,inp):
cumsum = list(itertools.accumulate(inp))
pts = bresenhamLine(0,0,max(cumsum),n)
yval = [y for x,y in pts]
cumsum2 = [yval[x] for x in cumsum]
res = [cumsum2[0]]
for i,x in enumerate(cumsum2[1:]):
res.append(x-cumsum2[i])
return res
print (proportional(20,[832, 325, 415, 385, 745]))
print (proportional(20,[414, 918, 860, 978, 438]))
...但是输出是
[6, 3, 3, 2, 6]
[2, 5, 5, 6, 2]
...这是不正确的,因为对于第一个列表中的第二个到第四个项目,它将“2”分配给排名中等的项目而不是排名最低的项目。 Hagenbach-Bischoff配额方法得到了正确的分配。