Question

我的SQL Server 2017数据库中有一个表，部分包含以下数据：

我的目的是为19列中的每列创建一个多元多项式回归，其中LikingOrder是我的因变量，给定RespID的19列中的每一个值都是自变量。

最终结果应该是每个RespID的C1至C19列的最高回归值。最终结果应如下所示：

我已经阅读了有关polym的内容，并尝试在以下脚本中使用它：

ALTER PROCEDURE [dbo].[spRegressionPeak]   
@StudyID int
AS
BEGIN
Declare @sStudyID VARCHAR(50)
Set @sStudyID = CONVERT(VARCHAR(50),@StudyID)

--We use IsNull values to pass zeroes where an average wasn't calculated os 
that the polynomial regression can be calculated.
DECLARE @inquery  AS NVARCHAR(MAX) = '
    Select
c.StudyID, c.RespID, c.LikingOrder, avg(C1) as C1, avg(C2) as C2, avg(C3) as 
C3, avg(C4) as C4, avg(C5) as C5, avg(C6) as C6, avg(C7) as C7, avg(C8) as 
C8, avg(C9) as C9, avg(C10) as C10,
avg(C11) as C11, avg(C12) as C12, avg(C13) as C13, avg(C14) as C14, avg(C15) 
as C15, avg(C16) as C16, avg(C17) as C17, avg(isnull(C18,0)) as C18, avg(C19) 
as C19
from ClosedStudyResponses c
where c.StudyID = @StudyID
group by StudyID, RespID, LikingOrder
order by RespID 

--We are setting @inquery aka InputDataSet to be our initial dataset.  
--R Services requires that a data.frame be passed to any calculations being 
generated.  As such, df is simply data framing the @inquery data.
--The res object holds the polynomial regression results by RespondentID and 
LikingOrder for each of the averages in the @inquery resultset.
EXEC sp_execute_external_script @language = N'R'
, @script = N'
    studymeans <- InputDataSet

    df <- data.frame(studymeans) 

    res1 <- lm(df$LikingOrder ~ polym(df$c1, df$c2, df$c3, df$c4, df$c5, df$c6, df$c7, df$c8, df$c9, 
    df$c10, df$c11, df$c12, df$c13, df$c14, df$c15, df$c16, df$c17, df$c18, df$c19, degree = 1, raw = TRUE)) 
    res <- data.frame(res1)

'
, @input_data_1 = @inquery
, @output_data_1_name = N'res'
, @params = N'@StudyID int'
,@StudyID = @StudyID 
--- Edit this line to handle the output data frame.
WITH RESULT SETS ((RespID int, res varchar(max)));
END;

当提供有效的StudyID时，上述存储过程会出现以下错误：

Error in model.frame.default(formula = df$LikingOrder ~ polym(df$c1, df$c2,  
: 
variable lengths differ (found for 'polym(df$c1, df$c2, df$c3, df$c4, df$c5, 
df$c6, df$c7, df$c8, df$c9, df$c10, df$c11, df$c12, df$c13, df$c14, df$c15, 
df$c16, df$c17, df$c18, df$c19, degree = 1, raw = TRUE)')
Calls: source ... lm -> eval -> eval -> <Anonymous> -> model.frame.default
In addition: There were 19 warnings (use warnings() to see them)

这是对polym的正确使用吗？如果没有，我如何实现计算19个独立回归的目标？最后，如何通过编程确定每次回归的最大值？

Answer 1

根据问题和评论中的讨论，做出的assumptions是：

RespID：是categorical parameter，未用于模型拟合
StudyID：在示例数据中被忽略
LinkingOrder：是因变量，即response（非分类）
C1 to C19：independent variables是数字
Objective：确定linear fit对变量C1至C19
Note：未添加polynomial fit，因为所讨论的最终请求表似乎未列出迭代项。
Resource：ISLR中的第3、5章

创建示例数据框

StudyID <- rep(10001, 100)
RespID <- c(rep(117,25), rep(119,25), rep(120,25), rep(121,25))
LinkingOrder <- floor(runif(100, 1, 9))
df <- data.frame(StudyID, RespID, LinkingOrder)
# Create columns C1 to C19
for (i in c(1:19)){
  vari <- paste("C", i, sep = "")
  df[vari] <-  floor(runif(100, 0, 9))
}

# Convert RespID to categorical variable
df$RespID <- as.factor(RespID)

适合lm（）并以表格格式存储系数

注意：拦截术语已包含在表格中

# Fit lm() and store coefficients in a table
final_table <- data.frame()
for (respid in unique(df$RespID)){
  data <- df[df['RespID']==respid, ]
  data <- subset(data, select = -c(StudyID, RespID))

  lm.fit <- lm(LinkingOrder ~ ., data=data)

  # Save to table
  final_table <- rbind(final_table, data.frame(t(unlist(lm.fit$coefficients))))
}

如何使用R求解多元多项式回归中的最大值（回归峰）？

1 个答案: