线性回归不返回期望的β数

时间:2019-10-15 16:10:15

标签: python python-3.x linear-regression

我有一个关于不同选举的政党辖区和选举结果的数据集。看完this article之后,我真的很想使用线性回归来回答这个问题:自上次选举以来,选民们是如何改变主意的?

Unnamed: 0  Map Level   Precinct ID Precinct Name   Election    Invalid Ballots (%) More Ballots Than Votes (#) More Votes Than Ballots (#) Total Voter Turnout (#) Total Voter Turnout (%) ... Average votes per minute (17:00-20:00)  CDM ED  FG  GD  LP  NR  UNM Results others
0   0   Precinct    1   63-1    2008 Parliamentary  0.0 0.0 0.0 749 62.11   ... 1.01    0.0 0.0 0.0 0.0 0.0 0.0 77.17   United National Movement    22.83
1   1   Precinct    10  63-10   2008 Parliamentary  0.0 0.0 0.0 419 70.42   ... 0.61    0.0 0.0 0.0 0.0 0.0 0.0 71.12   United National Movement    28.87
...
136 159 Precinct    8   63-1    2013 Presidential   1.75    0.0 0.0 506 50.75   ... 0.52    2.96    0.20    0.00    0.00    1.19    0.00    0.00    Giorgi Margvelashvili   95.65
137 160 Precinct    9   63-10   2013 Presidential   2.50    0.0 0.0 625 48.04   ... 0.66    1.92    0.80    0.00    0.00    1.60    0.00    0.00    Giorgi Margvelashvili   95.68

Precinct Name中提供给定区域的地方。

要了解哪个选民改变了主意,可以建立一个非常简单的模型。 您可以删除与您不感兴趣的所有政党(或在第一次和第二次选举中获得的选票少于一票)来简化N政党制度的选举。 然后,如果您假设所有在2014年进行类似投票的人都将在2019年以同样的方式改变主意。更具体地说,在2008年对P party进行投票的人在2013年对P party进行投票的可能性相同。 (我称此概率为Xᵢᵣ)

因此,对于给定的辖区,以便“解释”或“预测” 2013年Pᵣ党的票数Vᵣ²⁰¹⁹,基于2008年的结果,我可以按如下方式使用概率Xᵢᵣ:

$$V_r^{2013} = \sum_i V_i^{2008}\times X_{ir} $$

这是一个简单的线性回归。因此,就我们有7个参与方而言,结果应为每个$ X_ {ir} $大小为7的数组。但是。通过线性回归模型,我可以看到情况并非如此。

所以我尝试用Python实现该模型,对此感到抱歉:

def error(x_i,y_i, beta):
    return y_i - predict(x_i, beta)

def squared_error(x_i, y_i, beta):
    return error(x_i, y_i, beta)**2

def squared_error_gradient(x_i, y_i, beta):
    """the gradient (with respect to beta)
    corresponding to the ith squared error term"""
    return [-2 * x_ij * error(x_i,y_i, beta)
           for x_ij in x_i]

def predict(x_i, beta):
    # x_i.insert(0,1)
    """assumes that the first element of each x_i is 1"""
    return dot(x_i, beta)

def dot(v, w):
    """v_1 * w_1 + ... + v_n * w_n"""
    return sum(v_i * w_i for v_i, w_i in zip(v, w))

def in_random_order(data):
    """generator that returns the elements of data in random order"""
    indexes = [i for i, _ in enumerate(data)] # create a list of indexes
    random.shuffle(indexes) # shuffle them
    for i in indexes: # return the data in that order
        yield data[i]

def minimize_stochastic(target_fn, gradient_fn, x, y, theta_0, alpha_0=0.01):
    data = zip(x, y)
    theta = theta_0 # initial guess
    alpha = alpha_0 # initial step size
    min_theta, min_value = None, float("inf") # the minimum so far
    iterations_with_no_improvement = 0

    # if we ever go 100 iterations with no improvement, stop
    while iterations_with_no_improvement < 100:
        value = sum( target_fn(x_i, y_i, theta) for x_i, y_i in data )
        if value < min_value:
            # if we've found a new minimum, remember it
            # and go back to the original step size
            min_theta, min_value = theta, value
            iterations_with_no_improvement = 0
            alpha = alpha_0
        else:
            # otherwise we're not improving, so try shrinking the step size
            iterations_with_no_improvement += 1
            alpha *= 0.9
            # and take a gradient step for each of the data points
        for x_i, y_i in in_random_order(data):
            gradient_i = gradient_fn(x_i, y_i, theta)
            theta = vector_subtract(theta, scalar_multiply(alpha, gradient_i))
    return min_theta

def estimate_beta(x,y):
    beta_initial = [random.random() for x_i in x[0]]
    return minimize_stochastic(squared_error,
                              squared_error_gradient,
                              x,y,
                              beta_initial,
                              0.001)

例如,假设我们在2008年举行了一次选举,在2013年举行了一次选举:

x = [[0.0], [0.0], [0.0], [0.0], [0.0], [0.0], [77.17], [22.83]] # each arrangement within this arrangement is the % of people who voted for a party in 2008
y = [[0.35], [0.35], [0.0], [0.0], [2.43], [0.0], [0.0], [96.87]] # each number is the % of people who voted for a party in 2013
random.seed(0)
random.seed(0)
probabilities = [estimate_beta(x,y_i)for y_i in y]
print(probabilities)

它返回:

[[0.8444218515250481], [0.7579544029403025], [0.420571580830845], [0.25891675029296335], [0.5112747213686085], [0.4049341374504143], [0.7837985890347726], [0.30331272607892745]]

我期望每个数组中的值尽可能多。

0 个答案:

没有答案