我有一个关于不同选举的政党辖区和选举结果的数据集。看完this article之后,我真的很想使用线性回归来回答这个问题:自上次选举以来,选民们是如何改变主意的?
Unnamed: 0 Map Level Precinct ID Precinct Name Election Invalid Ballots (%) More Ballots Than Votes (#) More Votes Than Ballots (#) Total Voter Turnout (#) Total Voter Turnout (%) ... Average votes per minute (17:00-20:00) CDM ED FG GD LP NR UNM Results others
0 0 Precinct 1 63-1 2008 Parliamentary 0.0 0.0 0.0 749 62.11 ... 1.01 0.0 0.0 0.0 0.0 0.0 0.0 77.17 United National Movement 22.83
1 1 Precinct 10 63-10 2008 Parliamentary 0.0 0.0 0.0 419 70.42 ... 0.61 0.0 0.0 0.0 0.0 0.0 0.0 71.12 United National Movement 28.87
...
136 159 Precinct 8 63-1 2013 Presidential 1.75 0.0 0.0 506 50.75 ... 0.52 2.96 0.20 0.00 0.00 1.19 0.00 0.00 Giorgi Margvelashvili 95.65
137 160 Precinct 9 63-10 2013 Presidential 2.50 0.0 0.0 625 48.04 ... 0.66 1.92 0.80 0.00 0.00 1.60 0.00 0.00 Giorgi Margvelashvili 95.68
在Precinct Name
中提供给定区域的地方。
要了解哪个选民改变了主意,可以建立一个非常简单的模型。 您可以删除与您不感兴趣的所有政党(或在第一次和第二次选举中获得的选票少于一票)来简化N政党制度的选举。 然后,如果您假设所有在2014年进行类似投票的人都将在2019年以同样的方式改变主意。更具体地说,在2008年对P party进行投票的人在2013年对P party进行投票的可能性相同。 (我称此概率为Xᵢᵣ)
因此,对于给定的辖区,以便“解释”或“预测” 2013年Pᵣ党的票数Vᵣ²⁰¹⁹,基于2008年的结果,我可以按如下方式使用概率Xᵢᵣ:
这是一个简单的线性回归。因此,就我们有7个参与方而言,结果应为每个$ X_ {ir} $大小为7的数组。但是。通过线性回归模型,我可以看到情况并非如此。
所以我尝试用Python实现该模型,对此感到抱歉:
def error(x_i,y_i, beta):
return y_i - predict(x_i, beta)
def squared_error(x_i, y_i, beta):
return error(x_i, y_i, beta)**2
def squared_error_gradient(x_i, y_i, beta):
"""the gradient (with respect to beta)
corresponding to the ith squared error term"""
return [-2 * x_ij * error(x_i,y_i, beta)
for x_ij in x_i]
def predict(x_i, beta):
# x_i.insert(0,1)
"""assumes that the first element of each x_i is 1"""
return dot(x_i, beta)
def dot(v, w):
"""v_1 * w_1 + ... + v_n * w_n"""
return sum(v_i * w_i for v_i, w_i in zip(v, w))
def in_random_order(data):
"""generator that returns the elements of data in random order"""
indexes = [i for i, _ in enumerate(data)] # create a list of indexes
random.shuffle(indexes) # shuffle them
for i in indexes: # return the data in that order
yield data[i]
def minimize_stochastic(target_fn, gradient_fn, x, y, theta_0, alpha_0=0.01):
data = zip(x, y)
theta = theta_0 # initial guess
alpha = alpha_0 # initial step size
min_theta, min_value = None, float("inf") # the minimum so far
iterations_with_no_improvement = 0
# if we ever go 100 iterations with no improvement, stop
while iterations_with_no_improvement < 100:
value = sum( target_fn(x_i, y_i, theta) for x_i, y_i in data )
if value < min_value:
# if we've found a new minimum, remember it
# and go back to the original step size
min_theta, min_value = theta, value
iterations_with_no_improvement = 0
alpha = alpha_0
else:
# otherwise we're not improving, so try shrinking the step size
iterations_with_no_improvement += 1
alpha *= 0.9
# and take a gradient step for each of the data points
for x_i, y_i in in_random_order(data):
gradient_i = gradient_fn(x_i, y_i, theta)
theta = vector_subtract(theta, scalar_multiply(alpha, gradient_i))
return min_theta
def estimate_beta(x,y):
beta_initial = [random.random() for x_i in x[0]]
return minimize_stochastic(squared_error,
squared_error_gradient,
x,y,
beta_initial,
0.001)
例如,假设我们在2008年举行了一次选举,在2013年举行了一次选举:
x = [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [77.17], [22.83]] # each arrangement within this arrangement is the % of people who voted for a party in 2008
y = [[0.35], [0.35], [0.0], [0.0], [2.43], [0.0], [0.0], [96.87]] # each number is the % of people who voted for a party in 2013
random.seed(0)
random.seed(0)
probabilities = [estimate_beta(x,y_i)for y_i in y]
print(probabilities)
它返回:
[[0.8444218515250481], [0.7579544029403025], [0.420571580830845], [0.25891675029296335], [0.5112747213686085], [0.4049341374504143], [0.7837985890347726], [0.30331272607892745]]
我期望每个数组中的值尽可能多。