将pyspark.ml.feature.IDF与SparseVector一起使用时,java.lang.IllegalArgumentException

时间:2019-05-30 13:17:09

标签: apache-spark pyspark jupyter-notebook

我有一个包含两列的spark数据框,一列是paper_id,另一列是SparseVector。当我尝试使用spark ML库提供的IDF()API时,出现以下错误:

java.lang.IllegalArgumentException:要求失败:列sparse_vector必须为struct,values:array>类型,但实际上是struct,values:array>。

我正在Jupyter Notebook版本5.7.8和Python版本3.7.3上运行代码。我在下面添加了一个虚拟代码段。

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.mllib.linalg import SparseVector
from pyspark.ml.feature import IDF

sc = SparkContext()
sql = SQLContext(sc)

indices = [1,5,8]
values = [24, 72, 13]
sp_vec = SparseVector(10, indices, values)
df = sql.createDataFrame([("1", sp_vec)], ['paper_id', 'sparse_vector'])
df.show(truncate=False)

idf = IDF(inputCol='sparse_vector', outputCol='tf-idf')
model = idf.fit(df)

这是错误跟踪:

Py4JJavaError: An error occurred while calling o243.fit.
: java.lang.IllegalArgumentException: requirement failed: Column sparse_vector must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
    at scala.Predef$.require(Predef.scala:281)
    at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:44)
    at org.apache.spark.ml.feature.IDFBase.validateAndTransformSchema(IDF.scala:59)
    at org.apache.spark.ml.feature.IDFBase.validateAndTransformSchema$(IDF.scala:58)
    at org.apache.spark.ml.feature.IDF.validateAndTransformSchema(IDF.scala:68)
    at org.apache.spark.ml.feature.IDF.transformSchema(IDF.scala:98)
    at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
    at org.apache.spark.ml.feature.IDF.fit(IDF.scala:88)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

0 个答案:

没有答案