SQL Server中是否有任何线性回归函数?

时间:2010-03-29 09:31:13

标签: sql-server-2005 sql-server-2008 statistics linear-regression

SQL Server 2005/2008中是否有任何线性回归函数,类似于Linear Regression functions in Oracle

8 个答案:

答案 0 :(得分:40)

据我所知,没有。写一个很简单。下面给出了y = Alpha + Beta * x + epsilon的常数alpha和斜率beta:

-- test data (GroupIDs 1, 2 normal regressions, 3, 4 = no variance)
WITH some_table(GroupID, x, y) AS
(       SELECT 1,  1,  1    UNION SELECT 1,  2,  2    UNION SELECT 1,  3,  1.3  
  UNION SELECT 1,  4,  3.75 UNION SELECT 1,  5,  2.25 UNION SELECT 2, 95, 85    
  UNION SELECT 2, 85, 95    UNION SELECT 2, 80, 70    UNION SELECT 2, 70, 65    
  UNION SELECT 2, 60, 70    UNION SELECT 3,  1,  2    UNION SELECT 3,  1, 3
  UNION SELECT 4,  1,  2    UNION SELECT 4,  2,  2),
 -- linear regression query
/*WITH*/ mean_estimates AS
(   SELECT GroupID
          ,AVG(x * 1.)                                             AS xmean
          ,AVG(y * 1.)                                             AS ymean
    FROM some_table
    GROUP BY GroupID
),
stdev_estimates AS
(   SELECT pd.GroupID
          -- T-SQL STDEV() implementation is not numerically stable
          ,CASE      SUM(SQUARE(x - xmean)) WHEN 0 THEN 1 
           ELSE SQRT(SUM(SQUARE(x - xmean)) / (COUNT(*) - 1)) END AS xstdev
          ,     SQRT(SUM(SQUARE(y - ymean)) / (COUNT(*) - 1))     AS ystdev
    FROM some_table pd
    INNER JOIN mean_estimates  pm ON pm.GroupID = pd.GroupID
    GROUP BY pd.GroupID, pm.xmean, pm.ymean
),
standardized_data AS                   -- increases numerical stability
(   SELECT pd.GroupID
          ,(x - xmean) / xstdev                                    AS xstd
          ,CASE ystdev WHEN 0 THEN 0 ELSE (y - ymean) / ystdev END AS ystd
    FROM some_table pd
    INNER JOIN stdev_estimates ps ON ps.GroupID = pd.GroupID
    INNER JOIN mean_estimates  pm ON pm.GroupID = pd.GroupID
),
standardized_beta_estimates AS
(   SELECT GroupID
          ,CASE WHEN SUM(xstd * xstd) = 0 THEN 0
                ELSE SUM(xstd * ystd) / (COUNT(*) - 1) END         AS betastd
    FROM standardized_data pd
    GROUP BY GroupID
)
SELECT pb.GroupID
      ,ymean - xmean * betastd * ystdev / xstdev                   AS Alpha
      ,betastd * ystdev / xstdev                                   AS Beta
FROM standardized_beta_estimates pb
INNER JOIN stdev_estimates ps ON ps.GroupID = pb.GroupID
INNER JOIN mean_estimates  pm ON pm.GroupID = pb.GroupID

此处GroupID用于显示如何按源数据表中的某个值进行分组。如果您只想要表中所有数据的统计信息(不是特定的子组),则可以删除它和连接。为清楚起见,我使用了WITH语句。作为替代方案,您可以使用子查询。请注意表中使用的数据类型的精度,因为如果精度相对于您的数据不够高,数值稳定性会迅速恶化。

编辑:(在评论中回答彼得关于R2等其他统计数据的问题)

您可以使用相同的技术轻松计算其他统计信息。以下是具有R2,相关性和样本协方差的版本:

-- test data (GroupIDs 1, 2 normal regressions, 3, 4 = no variance)
WITH some_table(GroupID, x, y) AS
(       SELECT 1,  1,  1    UNION SELECT 1,  2,  2    UNION SELECT 1,  3,  1.3  
  UNION SELECT 1,  4,  3.75 UNION SELECT 1,  5,  2.25 UNION SELECT 2, 95, 85    
  UNION SELECT 2, 85, 95    UNION SELECT 2, 80, 70    UNION SELECT 2, 70, 65    
  UNION SELECT 2, 60, 70    UNION SELECT 3,  1,  2    UNION SELECT 3,  1, 3
  UNION SELECT 4,  1,  2    UNION SELECT 4,  2,  2),
 -- linear regression query
/*WITH*/ mean_estimates AS
(   SELECT GroupID
          ,AVG(x * 1.)                                             AS xmean
          ,AVG(y * 1.)                                             AS ymean
    FROM some_table pd
    GROUP BY GroupID
),
stdev_estimates AS
(   SELECT pd.GroupID
          -- T-SQL STDEV() implementation is not numerically stable
          ,CASE      SUM(SQUARE(x - xmean)) WHEN 0 THEN 1 
           ELSE SQRT(SUM(SQUARE(x - xmean)) / (COUNT(*) - 1)) END AS xstdev
          ,     SQRT(SUM(SQUARE(y - ymean)) / (COUNT(*) - 1))     AS ystdev
    FROM some_table pd
    INNER JOIN mean_estimates  pm ON pm.GroupID = pd.GroupID
    GROUP BY pd.GroupID, pm.xmean, pm.ymean
),
standardized_data AS                   -- increases numerical stability
(   SELECT pd.GroupID
          ,(x - xmean) / xstdev                                    AS xstd
          ,CASE ystdev WHEN 0 THEN 0 ELSE (y - ymean) / ystdev END AS ystd
    FROM some_table pd
    INNER JOIN stdev_estimates ps ON ps.GroupID = pd.GroupID
    INNER JOIN mean_estimates  pm ON pm.GroupID = pd.GroupID
),
standardized_beta_estimates AS
(   SELECT GroupID
          ,CASE WHEN SUM(xstd * xstd) = 0 THEN 0
                ELSE SUM(xstd * ystd) / (COUNT(*) - 1) END         AS betastd
    FROM standardized_data
    GROUP BY GroupID
)
SELECT pb.GroupID
      ,ymean - xmean * betastd * ystdev / xstdev                   AS Alpha
      ,betastd * ystdev / xstdev                                   AS Beta
      ,CASE ystdev WHEN 0 THEN 1 ELSE betastd * betastd END        AS R2
      ,betastd                                                     AS Correl
      ,betastd * xstdev * ystdev                                   AS Covar
FROM standardized_beta_estimates pb
INNER JOIN stdev_estimates ps ON ps.GroupID = pb.GroupID
INNER JOIN mean_estimates  pm ON pm.GroupID = pb.GroupID

EDIT 2 通过标准化数据(而不是仅居中)并因numerical stability issues替换STDEV来提高数值稳定性。对我来说,目前的实施似乎是稳定性和复杂性之间的最佳平衡。我可以通过用数值稳定的在线算法替换我的标准偏差来提高稳定性,但这会使实现变得非常复杂(并且减慢它)。类似地,使用例如Kahan(-Babuška-Neumaier)对SUMAVG的补偿似乎在有限的测试中表现得更好,但使查询更加复杂。只要我不知道T-SQL如何实现SUMAVG(例如它可能已经使用成对求和),我不能保证这些修改总能提高准确性。

答案 1 :(得分:14)

这是一种基于blog post on Linear Regression in T-SQL的替代方法,它使用以下等式:

enter image description here

博客中的SQL建议虽然使用了游标。这是我使用的forum answer的美化版本:

table
-----
X (numeric)
Y (numeric)

/**
 * m = (nSxy - SxSy) / (nSxx - SxSx)
 * b = Ay - (Ax * m)
 * N.B. S = Sum, A = Mean
 */
DECLARE @n INT
SELECT @n = COUNT(*) FROM table
SELECT (@n * SUM(X*Y) - SUM(X) * SUM(Y)) / (@n * SUM(X*X) - SUM(X) * SUM(X)) AS M,
       AVG(Y) - AVG(X) *
       (@n * SUM(X*Y) - SUM(X) * SUM(Y)) / (@n * SUM(X*X) - SUM(X) * SUM(X)) AS B
FROM table

答案 2 :(得分:3)

我实际上是使用Gram-Schmidt正交化编写了一个SQL例程。它以及其他机器学习和预测程序可在sqldatamine.blogspot.com

获得

根据Brad Larson的建议,我在这里添加了代码,而不仅仅是将用户引导到我的博客。这会产生与Excel中的linest函数相同的结果。我的主要资料来源是Hastie,Tibshirni和Friedman的“统计学习元素”(2008)。

--Create a table of data
create table #rawdata (id int,area float, rooms float, odd float,  price float)

insert into #rawdata select 1, 2201,3,1,400
insert into #rawdata select 2, 1600,3,0,330
insert into #rawdata select 3, 2400,3,1,369
insert into #rawdata select 4, 1416,2,1,232
insert into #rawdata select 5, 3000,4,0,540

--Insert the data into x & y vectors
select id xid, 0 xn,1 xv into #x from #rawdata
union all
select id, 1,rooms  from #rawdata
union all
select id, 2,area  from #rawdata
union all
select id, 3,odd  from #rawdata

select id yid, 0 yn, price yv  into #y from #rawdata

--create a residuals table and insert the intercept (1)
create table #z (zid int, zn int, zv float)
insert into #z select id , 0 zn,1 zv from #rawdata

--create a table for the orthoganal (#c) & regression(#b) parameters
create table #c(cxn int, czn int, cv float) 
create table #b(bn int, bv float) 


--@p is the number of independent variables including the intercept (@p = 0)
declare @p int
set @p = 1


--Loop through each independent variable and estimate the orthagonal parameter (#c)
-- then estimate the residuals and insert into the residuals table (#z)
while @p <= (select max(xn) from #x)
begin   
        insert into #c
    select  xn cxn,  zn czn, sum(xv*zv)/sum(zv*zv) cv 
        from #x join  #z on  xid = zid where zn = @p-1 and xn>zn group by xn, zn

    insert into #z
    select zid, xn,xv- sum(cv*zv) 
        from #x join #z on xid = zid   join  #c  on  czn = zn and cxn = xn  where xn = @p and zn<xn  group by zid, xn,xv

    set @p = @p +1
end

--Loop through each independent variable and estimate the regression parameter by regressing the orthoganal
-- resiuduals on the dependent variable y
while @p>=0 
begin

    insert into #b
    select zn, sum(yv*zv)/ sum(zv*zv) 
        from #z  join 
            (select yid, yv-isnull(sum(bv*xv),0) yv from #x join #y on xid = yid left join #b on  xn=bn group by yid, yv) y
        on zid = yid where zn = @p  group by zn

    set @p = @p-1
end

--The regression parameters
select * from #b

--Actual vs. fit with error
select yid, yv, fit, yv-fit err from #y join 
    (select xid, sum(xv*bv) fit from #x join #b on xn = bn  group by xid) f
     on yid = xid

--R Squared
select 1-sum(power(err,2))/sum(power(yv,2)) from 
(select yid, yv, fit, yv-fit err from #y join 
    (select xid, sum(xv*bv) fit from #x join #b on xn = bn  group by xid) f
     on yid = xid) d

答案 3 :(得分:2)

SQL Server中没有线性回归函数。但是要计算数据点x,y对之间的简单线性回归(Y&#39; = bX + A) - 包括相关系数的计算,确定系数(R ^ 2)和误差的标准估计(标准偏差) ),执行以下操作:

对于包含数字列regression_datax的表格y

declare @total_points int 
declare @intercept DECIMAL(38, 10)
declare @slope DECIMAL(38, 10)
declare @r_squared DECIMAL(38, 10)
declare @standard_estimate_error DECIMAL(38, 10)
declare @correlation_coefficient DECIMAL(38, 10)
declare @average_x  DECIMAL(38, 10)
declare @average_y  DECIMAL(38, 10)
declare @sumX DECIMAL(38, 10)
declare @sumY DECIMAL(38, 10)
declare @sumXX DECIMAL(38, 10)
declare @sumYY DECIMAL(38, 10)
declare @sumXY DECIMAL(38, 10)
declare @Sxx DECIMAL(38, 10)
declare @Syy DECIMAL(38, 10)
declare @Sxy DECIMAL(38, 10)

Select 
@total_points = count(*),
@average_x = avg(x),
@average_y = avg(y),
@sumX = sum(x),
@sumY = sum(y),
@sumXX = sum(x*x),
@sumYY = sum(y*y),
@sumXY = sum(x*y)
from regression_data

set @Sxx = @sumXX - (@sumX * @sumX) / @total_points
set @Syy = @sumYY - (@sumY * @sumY) / @total_points
set @Sxy = @sumXY - (@sumX * @sumY) / @total_points

set @correlation_coefficient = @Sxy / SQRT(@Sxx * @Syy) 
set @slope = (@total_points * @sumXY - @sumX * @sumY) / (@total_points * @sumXX - power(@sumX,2))
set @intercept = @average_y - (@total_points * @sumXY - @sumX * @sumY) / (@total_points * @sumXX - power(@sumX,2)) * @average_x
set @r_squared = (@intercept * @sumY + @slope * @sumXY - power(@sumY,2) / @total_points) / (@sumYY - power(@sumY,2) / @total_points)

-- calculate standard_estimate_error (standard deviation)
Select
@standard_estimate_error = sqrt(sum(power(y - (@slope * x + @intercept),2)) / @total_points)
From regression_data

答案 4 :(得分:1)

要添加到@ icc97答案,我已将加权版本包含在斜率和截距中。如果值都是常量,则斜率将为NULL(使用适当的设置SET ARITHABORT OFF; SET ANSI_WARNINGS OFF;),并且需要通过coalesce()替换为0.

以下是用SQL编写的解决方案:

with d as (select segment,w,x,y from somedatasource)
select segment,

avg(y) - avg(x) *
((count(*) * sum(x*y)) - (sum(x)*sum(y)))/
((count(*) * sum(x*x)) - (Sum(x)*Sum(x)))   as intercept,

((count(*) * sum(x*y)) - (sum(x)*sum(y)))/
((count(*) * sum(x*x)) - (sum(x)*sum(x))) AS slope,

avg(y) - ((avg(x*y) - avg(x)*avg(y))/var_samp(X)) * avg(x) as interceptUnstable,
(avg(x*y) - avg(x)*avg(y))/var_samp(X) as slopeUnstable,
(Avg(x * y) - Avg(x) * Avg(y)) / (stddev_pop(x) * stddev_pop(y)) as correlationUnstable,

(sum(y*w)/sum(w)) - (sum(w*x)/sum(w)) *
((sum(w)*sum(x*y*w)) - (sum(x*w)*sum(y*w)))/
  ((sum(w)*sum(x*x*w)) - (sum(x*w)*sum(x*w)))   as wIntercept,

((sum(w)*sum(x*y*w)) - (sum(x*w)*sum(y*w)))/
  ((sum(w)*sum(x*x*w)) - (sum(x*w)*sum(x*w))) as wSlope,

(count(*) * sum(x * y) - sum(x) * sum(y)) / (sqrt(count(*) * sum(x * x) - sum(x) * sum(x))
* sqrt(count(*) * sum(y * y) - sum(y) * sum(y))) as correlation,

count(*) as n

from d where x is not null and y is not null group by segment

其中w是重量。我对R进行了双重检查以确认结果。 可能需要将数据从somedatasource转换为浮点数。 我包含了不稳定的版本来警告你不要那些。 (特别感谢Stephan的另一个答案。)

请记住,相关性是数据点x和y的相关性,而不是预测的相关性。

答案 5 :(得分:0)

这是一个函数,它采用类型的表类型:table(Y float,X double) 调用XYDoubleType并假设我们的线性函数的形式为AX + B.它返回A和B一个表列,以防万一你想在连接中使用它或者

CREATE FUNCTION FN_GetABForData(
 @XYData as XYDoubleType READONLY
 ) RETURNS  @ABData TABLE(
            A  FLOAT,
            B FLOAT, 
            Rsquare FLOAT )
 AS
 BEGIN
    DECLARE @sx FLOAT, @sy FLOAT
    DECLARE @sxx FLOAT,@syy FLOAT, @sxy FLOAT,@sxsy FLOAT, @sxsx FLOAT, @sysy FLOAT
    DECLARE @n FLOAT, @A FLOAT, @B FLOAT, @Rsq FLOAT

    SELECT @sx =SUM(D.X) ,@sy =SUM(D.Y), @sxx=SUM(D.X*D.X),@syy=SUM(D.Y*D.Y),
        @sxy =SUM(D.X*D.Y),@n =COUNT(*)
    From @XYData D
    SET @sxsx =@sx*@sx
    SET @sxsy =@sx*@sy
    SET @sysy = @sy*@sy

    SET @A = (@n*@sxy -@sxsy)/(@n*@sxx -@sxsx)
    SET @B = @sy/@n  - @A*@sx/@n
    SET @Rsq = POWER((@n*@sxy -@sxsy),2)/((@n*@sxx-@sxsx)*(@n*@syy -@sysy))

    INSERT INTO @ABData (A,B,Rsquare) VALUES(@A,@B,@Rsq)

    RETURN 
 END

答案 6 :(得分:0)

我已经翻译了Excel中功能预测中使用的线性回归函数,并创建了一个返回a,b和预测的SQL函数。 您可以在excel帮助中查看完整的teorical解释,以获得FORECAST功能。 首先要做的就是创建表数据类型XYFloatType:

 CREATE TYPE [dbo].[XYFloatType] 
AS TABLE(
[X] FLOAT,
[Y] FLOAT)

然后编写以下函数:

    /*
-- =============================================
-- Author:      Me      :)
-- Create date: Today   :)
-- Description: (Copied Excel help): 
--Calculates, or predicts, a future value by using existing values. 
The predicted value is a y-value for a given x-value. 
The known values are existing x-values and y-values, and the new value is predicted by using linear regression. 
You can use this function to predict future sales, inventory requirements, or consumer trends.
-- =============================================
*/

CREATE FUNCTION dbo.FN_GetLinearRegressionForcast

(@PtXYData as XYFloatType READONLY ,@PnFuturePointint)
RETURNS @ABDData TABLE( a FLOAT, b FLOAT, Forecast FLOAT)
AS

BEGIN 
    DECLARE  @LnAvX Float
            ,@LnAvY Float
            ,@LnB Float
            ,@LnA Float
            ,@LnForeCast Float
    Select   @LnAvX = AVG([X])
            ,@LnAvY = AVG([Y])
    FROM @PtXYData;
    SELECT @LnB =  SUM ( ([X]-@LnAvX)*([Y]-@LnAvY) )  /  SUM (POWER([X]-@LnAvX,2))
    FROM @PtXYData;
    SET @LnA = @LnAvY - @LnB * @LnAvX;
    SET @LnForeCast = @LnA + @LnB * @PnFuturePoint;
    INSERT INTO @ABDData ([A],[B],[Forecast]) VALUES (@LnA,@LnB,@LnForeCast)
    RETURN 
END

/*
your tests: 

 (I used the same values that are in the excel help)
DECLARE @t XYFloatType 
INSERT @t VALUES(20,6),(28,7),(31,9),(38,15),(40,21)        -- x and y values
SELECT *, A+B*30 [Prueba]FROM dbo.FN_GetLinearRegressionForcast@t,30);
*/

答案 7 :(得分:0)

我希望以下答案有助于理解某些解决方案的来源。我将用一个简单的例子来说明它,但只要你知道如何使用索引表示法或矩阵,对许多变量的泛化在理论上是直截了当的。为了实现超过3个变量的解决方案,你可以使用Gram-Schmidt(参见上面的Colin Campbell的答案)或其他矩阵求逆算法。

由于我们需要的所有函数都是方差,协方差,平均值,总和等等都是SQL中的聚合函数,因此可以轻松实现解决方案。我在HIVE中已经完成了对Logistic模型得分的线性校准 - 在许多优点中,一个是你可以完全在HIVE中运行,而无需从一些脚本语言中退出并返回。

您的数据点由i索引的数据模型(x_1,x_2,y)是

  

y(x_1,x_2)= m_1 * x_1 + m_2 * x_2 + c

模型看起来是“线性的”,但不一定是,例如x_2可以是x_1的任何非线性函数,只要它没有自由参数,例如: x_2 = Sinh(3 *(x_1)^ 2 + 42)。即使x_2是“仅”x_2并且模型是线性的,回归问题也不是。只有当您确定问题是找到参数m_1,m_2,c以便它们最小化L2错误时,才会出现线性回归问题。

L2误差是sum_i((y [i] -f(x_1 [i],x_2 [i]))^ 2)。最小化这个w.r.t. 3个参数(设置偏导数w.r.t.每个参数= 0)产生3个未知数的3个线性方程。这些方程在参数中是LINEAR(这就是使其成为线性回归的方法)并且可以通过分析求解。对于简单模型(1个变量,线性模型,因此有两个参数)这样做是直截了当且具有指导性的。在误差向量空间上对非欧几里德度量范数的推广是直截了当的,对角特殊情况相当于使用“权重”。

以两个变量回到我们的模型:

  

y = m_1 * x_1 + m_2 * x_2 + c

取期望值=&gt;

  

= m_1 * + m_2 * + c(0)

现在采取协方差w.r.t. x_1和x_2,并使用cov(x,x)= var(x):

  

cov(y,x_1)= m_1 * var(x_1)+ m_2 * covar(x_2,x_1)(1)

     

cov(y,x_2)= m_1 * covar(x_1,x_2)+ m_2 * var(x_2)(2)

这是两个未知数的方程式,您可以通过反转2X2矩阵来求解。

以矩阵形式: ... 可以倒转产量 ... 其中

  

det = var(x_1)* var(x_2) - covar(x_1,x_2)^ 2

(哦,barf,到底是什么“声望点?如果你想看方程,请给我一些。”

在任何情况下,既然你有m1和m2的封闭形式,你可以解决(0)c。

我将上面的分析解决方案检查到Excel的求解器中,得到高斯噪声的二次方,剩余误差达到6位有效数字。

如果您想在SQL中进行大约20行的离散傅立叶变换,请与我联系。