一个数据集的多元回归

时间:2019-04-21 10:54:12

标签: regression

示例:https://i.stack.imgur.com/G1T4f.png (此图片是在Google上随机找到的。)

我想知道是否有现有的回归算法可以将多行拟合到数据中,如图所示,即使数据点混合在一起(未标记)?我认为可以通过反复增加线的数量并将点聚类到线来实现。

谢谢。

1 个答案:

答案 0 :(得分:3)

您正在寻找的模型称为RANSAC,这是在嘈杂的点数据中查找多条线的好方法。标准RANSAC的用法是选择最佳假设(在这种情况下为行),但您也可以根据数据轻松选择最佳2或4行。

这是skimage中的一个示例(它也存在于sklearn中):

import numpy as np
from matplotlib import pyplot as plt

from skimage.measure import LineModelND, ransac


np.random.seed(seed=1)

# generate coordinates of line
x = np.arange(-200, 200)
y = 0.2 * x + 20
data = np.column_stack([x, y])

# add gaussian noise to coordinates
noise = np.random.normal(size=data.shape)
data += 0.5 * noise
data[::2] += 5 * noise[::2]
data[::4] += 20 * noise[::4]

# add faulty data
faulty = np.array(30 * [(180., -100)])
faulty += 10 * np.random.normal(size=faulty.shape)
data[:faulty.shape[0]] = faulty

# fit line using all data
model = LineModelND()
model.estimate(data)

# robustly fit line only using inlier data with RANSAC algorithm
model_robust, inliers = ransac(data, LineModelND, min_samples=2,
                               residual_threshold=1, max_trials=1000)
outliers = inliers == False

# generate coordinates of estimated models
line_x = np.arange(-250, 250)
line_y = model.predict_y(line_x)
line_y_robust = model_robust.predict_y(line_x)

fig, ax = plt.subplots()
ax.plot(data[inliers, 0], data[inliers, 1], '.b', alpha=0.6,
        label='Inlier data')
ax.plot(data[outliers, 0], data[outliers, 1], '.r', alpha=0.6,
        label='Outlier data')
ax.plot(line_x, line_y, '-k', label='Line model from all data')
ax.plot(line_x, line_y_robust, '-b', label='Robust line model')
ax.legend(loc='lower left')
plt.show()
  

这是针对您的特定问题而开发的:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import linear_model

MIN_SAMPLES = 3

x = np.linspace(0, 2, 100)

xs, ys = [], []

# generate points for thee lines described by a and b,
# we also add some noise:
for a, b in [(1.0, 2), (0.5, 1), (1.2, -1)]:
    xs.extend(x)
    ys.extend(a * x + b + .1 * np.random.randn(len(x)))

xs = np.array(xs)
ys = np.array(ys)
plt.plot(xs, ys, "r.")

colors = "rgbky"
idx = 0

while len(xs) > MIN_SAMPLES:

    # build design matrix for linear regressor
    X = np.ones((len(xs), 2))
    X[:, 1] = xs

    ransac = linear_model.RANSACRegressor(
        residual_threshold=.3, min_samples=MIN_SAMPLES
    )

    res = ransac.fit(X, ys)

    # vector of boolean values, describes which points belong
    # to the fitted line:
    inlier_mask = ransac.inlier_mask_

    # plot point cloud:
    xinlier = xs[inlier_mask]
    yinlier = ys[inlier_mask]

    # circle through colors:
    color = colors[idx % len(colors)]
    idx += 1
    plt.plot(xinlier, yinlier, color + "*")

    # only keep the outliers:
    xs = xs[~inlier_mask]
    ys = ys[~inlier_mask]

plt.show()

enter image description here enter image description here