我的SQL Server 2017数据库中有一个表,部分包含以下数据:
我的目的是为19列中的每列创建一个多元多项式回归,其中LikingOrder是我的因变量,给定RespID的19列中的每一个值都是自变量。
最终结果应该是每个RespID的C1至C19列的最高回归值。最终结果应如下所示:
我已经阅读了有关polym的内容,并尝试在以下脚本中使用它:
ALTER PROCEDURE [dbo].[spRegressionPeak]
@StudyID int
AS
BEGIN
Declare @sStudyID VARCHAR(50)
Set @sStudyID = CONVERT(VARCHAR(50),@StudyID)
--We use IsNull values to pass zeroes where an average wasn't calculated os
that the polynomial regression can be calculated.
DECLARE @inquery AS NVARCHAR(MAX) = '
Select
c.StudyID, c.RespID, c.LikingOrder, avg(C1) as C1, avg(C2) as C2, avg(C3) as
C3, avg(C4) as C4, avg(C5) as C5, avg(C6) as C6, avg(C7) as C7, avg(C8) as
C8, avg(C9) as C9, avg(C10) as C10,
avg(C11) as C11, avg(C12) as C12, avg(C13) as C13, avg(C14) as C14, avg(C15)
as C15, avg(C16) as C16, avg(C17) as C17, avg(isnull(C18,0)) as C18, avg(C19)
as C19
from ClosedStudyResponses c
where c.StudyID = @StudyID
group by StudyID, RespID, LikingOrder
order by RespID
--We are setting @inquery aka InputDataSet to be our initial dataset.
--R Services requires that a data.frame be passed to any calculations being
generated. As such, df is simply data framing the @inquery data.
--The res object holds the polynomial regression results by RespondentID and
LikingOrder for each of the averages in the @inquery resultset.
EXEC sp_execute_external_script @language = N'R'
, @script = N'
studymeans <- InputDataSet
df <- data.frame(studymeans)
res1 <- lm(df$LikingOrder ~ polym(df$c1, df$c2, df$c3, df$c4, df$c5, df$c6, df$c7, df$c8, df$c9,
df$c10, df$c11, df$c12, df$c13, df$c14, df$c15, df$c16, df$c17, df$c18, df$c19, degree = 1, raw = TRUE))
res <- data.frame(res1)
'
, @input_data_1 = @inquery
, @output_data_1_name = N'res'
, @params = N'@StudyID int'
,@StudyID = @StudyID
--- Edit this line to handle the output data frame.
WITH RESULT SETS ((RespID int, res varchar(max)));
END;
当提供有效的StudyID时,上述存储过程会出现以下错误:
Error in model.frame.default(formula = df$LikingOrder ~ polym(df$c1, df$c2,
:
variable lengths differ (found for 'polym(df$c1, df$c2, df$c3, df$c4, df$c5,
df$c6, df$c7, df$c8, df$c9, df$c10, df$c11, df$c12, df$c13, df$c14, df$c15,
df$c16, df$c17, df$c18, df$c19, degree = 1, raw = TRUE)')
Calls: source ... lm -> eval -> eval -> <Anonymous> -> model.frame.default
In addition: There were 19 warnings (use warnings() to see them)
这是对polym的正确使用吗?如果没有,我如何实现计算19个独立回归的目标?最后,如何通过编程确定每次回归的最大值?
答案 0 :(得分:1)
根据问题和评论中的讨论,做出的assumptions
是:
RespID
:是categorical parameter
,未用于模型拟合StudyID
:在示例数据中被忽略LinkingOrder
:是因变量,即response
(非分类) C1 to C19
:independent variables
是数字
Objective
:确定linear fit
对变量C1
至C19
Note
:未添加polynomial fit
,因为所讨论的最终请求表似乎未列出迭代项。Resource
:ISLR中的第3、5章创建示例数据框
StudyID <- rep(10001, 100)
RespID <- c(rep(117,25), rep(119,25), rep(120,25), rep(121,25))
LinkingOrder <- floor(runif(100, 1, 9))
df <- data.frame(StudyID, RespID, LinkingOrder)
# Create columns C1 to C19
for (i in c(1:19)){
vari <- paste("C", i, sep = "")
df[vari] <- floor(runif(100, 0, 9))
}
# Convert RespID to categorical variable
df$RespID <- as.factor(RespID)
适合lm()并以表格格式存储系数
注意:拦截术语已包含在表格中
# Fit lm() and store coefficients in a table
final_table <- data.frame()
for (respid in unique(df$RespID)){
data <- df[df['RespID']==respid, ]
data <- subset(data, select = -c(StudyID, RespID))
lm.fit <- lm(LinkingOrder ~ ., data=data)
# Save to table
final_table <- rbind(final_table, data.frame(t(unlist(lm.fit$coefficients))))
}