Question

我在训练期间设置的数据包含0到23之间的小时字段值：

model = pipeline.fit(df)
prediction = model.transform(df)

因此，OneHotEncoder创建的稀疏向量如下所示：

（24，[5]，[1.0]）

现在我想在单行设置为10的单行数据集df2上对模型进行评分：

model = pipeline.fit(df)
prediction = model.transform(df2)

现在获得的小时稀疏矢量对象如下所示：

（11，[10]，[1.0]）

因此，当尝试使用训练范围之外的小时值对模型进行评分时，我收到此错误：

Caused by: java.lang.IndexOutOfBoundsException: 21 not in [0,11)

Here the full error message.

但请注意，我正在使用包含整个范围的数据框来调用pipeline.fit：

df.agg({"hour": "min"}).show()
df.agg({"hour": "max"}).show()



 +---------+ 
 |min(hour)| 
 +---------+ 
 |        0| 
 +---------+ 

 +---------+ 
 |max(hour)| 
 +---------+ 
 |       23| 
 +---------+

那么有没有办法给OneHotEncoder提供编码矢量范围的提示？或者有更好的方法吗？

编辑10月9日

我被告知在this帖子中存在解决我问题的方法。但不幸的是，我在尝试python解决方案时遇到了这个错误：

TypeError：alias（）得到了一个意外的关键字参数'metadata'

我在使用Spark V2.1（在IBM DataScience Experience上，因此无法升级到V2.2，但必须等待......）

OneHotEncoder创建的稀疏向量太短（ApacheSpark，pyspark）

0 个答案: