我一直在尝试在sparkR中的数据集上拟合glm(Poisson with log link,具体而言)。它非常大,因此收集它并使用R自己的glm()不太可行。这包括一个暴露期限,需要作为一个偏移量包含在内(已知系数的回归量 - 在我的情况下为1)。不幸的是,既没有在公式中添加偏移项,也没有传递列名(或列本身,或者选择它后收集coumn形成的数字向量) - 在第一种情况下,公式未被解析,并且在其他情况下,忽略偏移项 - 完全没有错误消息。这是我一直在尝试做的一个例子(评论中的输出):
library(datasets)
#set up Spark session
#Sys.setenv(SPARK_HOME = "/usr/share/spark_2.1.0")
library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
options(scipen = 15, digits = 5)
sparkR.session(spark.executor.instances = "20", spark.executor.memory = "6g")
# # Setting default log level to "WARN".
# # To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
# # 17/06/19 06:33:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
# # 17/06/19 06:33:40 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
# # 17/06/19 06:34:22 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
message(sparkR.conf()$spark.app.id)
# # application_*************_****
#Test glm() in sparkR
data("iris")
iris_df = createDataFrame(iris)
# # Warning messages:
# # 1: In FUN(X[[i]], ...) :
# # Use Sepal_Length instead of Sepal.Length as column name
# # 2: In FUN(X[[i]], ...) :
# # Use Sepal_Width instead of Sepal.Width as column name
# # 3: In FUN(X[[i]], ...) :
# # Use Petal_Length instead of Petal.Length as column name
# # 4: In FUN(X[[i]], ...) :
# # Use Petal_Width instead of Petal.Width as column name
model = glm(Sepal_Length ~ offset(Sepal_Width) + Petal_Length, data = iris_df)
# # 17/06/19 08:46:47 ERROR RBackendHandler: fit on org.apache.spark.ml.r.GeneralizedLinearRegressionWrapper failed
# # java.lang.reflect.InvocationTargetException
# # ......
# # Caused by: java.lang.IllegalArgumentException: Could not parse formula: Sepal_Length ~ offset(Sepal_Width) + Petal_Length
# # at org.apache.spark.ml.feature.RFormulaParser$.parse(RFormulaParser.scala:200)
# # ......
model = glm(Sepal_Length ~ Petal_Length + offset(Sepal_Width), data = iris_df)
# # (Same error as above)
# The one below runs.
model = glm(Sepal_Length ~ Petal_Length, offset = Sepal_Width, data = iris_df, family = gaussian())
# # 17/06/19 08:51:21 WARN WeightedLeastSquares: regParam is zero, which might cause numerical instability and overfitting.
# # 17/06/19 08:51:24 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
# # 17/06/19 08:51:24 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
# # 17/06/19 08:51:24 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK
# # 17/06/19 08:51:24 WARN LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK
summary(model)
# # Deviance Residuals:
# # (Note: These are approximate quantiles with relative error <= 0.01)
# # Min 1Q Median 3Q Max
# # -1.24675 -0.30140 -0.01999 0.26700 1.00269
# #
# # Coefficients:
# # Estimate Std. Error t value Pr(>|t|)
# # (Intercept) 4.3066 0.078389 54.939 0
# # Petal_Length 0.40892 0.018891 21.646 0
# #
# # (Dispersion parameter for gaussian family taken to be 0.1657097)
# #
# # Null deviance: 102.168 on 149 degrees of freedom
# # Residual deviance: 24.525 on 148 degrees of freedom
# # AIC: 160
# #
# # Number of Fisher Scoring iterations: 1
# (RESULTS ARE SAME AS GLM WITHOUT OFFSET)
# Results in R:
model = glm(Sepal.Length ~ Petal.Length, offset = Sepal.Width, data = iris, family = gaussian())
summary(model)
# # Call:
# # glm(formula = Sepal.Length ~ Petal.Length, family = gaussian(),
# # data = iris, offset = Sepal.Width)
# #
# # Deviance Residuals:
# # Min 1Q Median 3Q Max
# # -0.93997 -0.27232 -0.02085 0.28576 0.88944
# #
# # Coefficients:
# # Estimate Std. Error t value Pr(>|t|)
# # (Intercept) 0.85173 0.07098 12.00 <2e-16 ***
# # Petal.Length 0.51471 0.01711 30.09 <2e-16 ***
# # ---
# # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# #
# # (Dispersion parameter for gaussian family taken to be 0.1358764)
# #
# # Null deviance: 143.12 on 149 degrees of freedom
# # Residual deviance: 20.11 on 148 degrees of freedom
# # AIC: 130.27
# #
# # Number of Fisher Scoring iterations: 2
#Results in R without offset. Matches SparkR output with and w/o offset.
model = glm(Sepal.Length ~ Petal.Length, data = iris, family = gaussian())
summary(model)
# # Call:
# # glm(formula = Sepal.Length ~ Petal.Length, family = gaussian(),
# # data = iris)
# #
# # Deviance Residuals:
# # Min 1Q Median 3Q Max
# # -1.24675 -0.29657 -0.01515 0.27676 1.00269
# #
# # Coefficients:
# # Estimate Std. Error t value Pr(>|t|)
# # (Intercept) 4.30660 0.07839 54.94 <2e-16 ***
# # Petal.Length 0.40892 0.01889 21.65 <2e-16 ***
# # ---
# # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# #
# # (Dispersion parameter for gaussian family taken to be 0.1657097)
# #
# # Null deviance: 102.168 on 149 degrees of freedom
# # Residual deviance: 24.525 on 148 degrees of freedom
# # AIC: 160.04
# #
# # Number of Fisher Scoring iterations: 2
注意:Spark版本是2.1.0(如代码中所示)。从我检查的实现应该是在那里。此外,gl之后的警告消息并不总是出现,但这似乎不会对正在发生的事情产生影响。
我做错了什么,或者在glm的spark实现中没有使用偏移项?如果是第二个,是否有任何解决方法可以获得与偏移项相同的结果?
答案 0 :(得分:1)
具有响应Y和偏移log(K)的泊松GLM与具有响应Y / K和权重K的GLM相同。
使用MASS中的保险数据集的示例:
> glm(Claims ~ District + Group + Age, data=Insurance, family=poisson, offset=log(Holders))
Call: glm(formula = Claims ~ District + Group + Age, family = poisson,
data = Insurance, offset = log(Holders))
Coefficients:
(Intercept) District2 District3 District4 Group.L Group.Q Group.C Age.L Age.Q Age.C
-1.810508 0.025868 0.038524 0.234205 0.429708 0.004632 -0.029294 -0.394432 -0.000355 -0.016737
Degrees of Freedom: 63 Total (i.e. Null); 54 Residual
Null Deviance: 236.3
Residual Deviance: 51.42 AIC: 388.7
> glm(Claims/Holders ~ District + Group + Age, data=Insurance, family=quasipoisson, weights=Holders)
Call: glm(formula = Claims/Holders ~ District + Group + Age, family = quasipoisson,
data = Insurance, weights = Holders)
Coefficients:
(Intercept) District2 District3 District4 Group.L Group.Q Group.C Age.L Age.Q Age.C
-1.810508 0.025868 0.038524 0.234205 0.429708 0.004632 -0.029294 -0.394432 -0.000355 -0.016737
Degrees of Freedom: 63 Total (i.e. Null); 54 Residual
Null Deviance: 236.3
Residual Deviance: 51.42 AIC: NA
(quasipoisson
系列将R关闭为响应检测到的非整数值。)
此技术也可用于Spark的GLM实现。