为什么用-9999替换缺失值?

时间:2020-08-13 06:19:55

标签: python-3.x linear-regression missing-data

使用numpy的线性回归模型的代码:

from statistics import mean
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
from sklearn.linear_model import LinearRegression

def create_dataset(hm, variance, step=2, correlation=False):
    val = 1
    ys = []
    for i in range(hm):
        y = val + random.randrange(-variance, variance)
        ys.append(y)
        if correlation and correlation == 'pos':
            val += step
        elif correlation and correlation == 'meg':
            val -= step
    xs = [i for i in range(len(ys))]

    return np.array(xs, dtype=np.float64), np.array(ys, dtype=np.float64)

def best_fit_slope_and_intercept(X, y):
    x_mean = mean(X)
    y_mean = mean(y)
    gradient_calc1 = x_mean * y_mean - mean(X * y)
    gradient_calc2 = x_mean ** 2 - mean(X ** 2)
    gradient = gradient_calc1 / gradient_calc2
    intercept = y_mean - gradient * x_mean
    return gradient, intercept


def r_squared(gradient, intercept, xs, ys):
    the_mean = mean(ys)
    regression_y = [gradient * x + intercept for x in xs]
    total_error = (abs(ys - the_mean) ** 2).sum()
    explained_error = (abs(ys - regression_y) ** 2).sum()
    return 1 - explained_error / total_error

xs, ys = create_dataset(100, 1000, 10, 'pos')
m, b = best_fit_slope_and_intercept(xs, ys)
regression_line = [m * x + b for x in xs]

regressor = LinearRegression()
regressor.fit(xs.reshape(-1, 1), ys)

prediction = regressor.predict(xs.reshape(-1, 1))
print(r_squared(m, b, xs, ys))
plt.scatter((xs), ys)
plt.plot(xs, regression_line)
plt.plot(xs, prediction)
plt.show()
  1. 关于这一点,谁能告诉我为什么我应该像senddex在他的教程中那样用-9999替换缺失的值?

关于这一点,senddex说大多数算法都将此类输入识别为离群值?是否有特定的处理方法?数学是否可以防止异常值对回归产生负面影响。

  1. 这是机器学习库中用于实现线性回归的方法还是其他概念。

有关从头开始线性回归的视频:https://www.youtube.com/watch?v=QUyAFokOmow&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v&index=11

1 个答案:

答案 0 :(得分:0)

  1. 将所有缺失值替换为0或-9999有助于删除NaN数,从而帮助您以更高的效率对数据集进行归一化(如果对数据集进行归一化)。同样,将NaN数字设置为-9999有助于您的代码将该值视为异常值。 Click here for further information !
  2. 是,线性回归是机器学习中使用的一种方法。还有很多其他过程,例如决策树!