我有一个包含两列的spark数据框,一列是paper_id,另一列是SparseVector。当我尝试使用spark ML库提供的IDF()API时,出现以下错误:
java.lang.IllegalArgumentException:要求失败:列sparse_vector必须为struct,values:array>类型,但实际上是struct,values:array>。
我正在Jupyter Notebook版本5.7.8和Python版本3.7.3上运行代码。我在下面添加了一个虚拟代码段。
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.mllib.linalg import SparseVector
from pyspark.ml.feature import IDF
sc = SparkContext()
sql = SQLContext(sc)
indices = [1,5,8]
values = [24, 72, 13]
sp_vec = SparseVector(10, indices, values)
df = sql.createDataFrame([("1", sp_vec)], ['paper_id', 'sparse_vector'])
df.show(truncate=False)
idf = IDF(inputCol='sparse_vector', outputCol='tf-idf')
model = idf.fit(df)
这是错误跟踪:
Py4JJavaError: An error occurred while calling o243.fit.
: java.lang.IllegalArgumentException: requirement failed: Column sparse_vector must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:44)
at org.apache.spark.ml.feature.IDFBase.validateAndTransformSchema(IDF.scala:59)
at org.apache.spark.ml.feature.IDFBase.validateAndTransformSchema$(IDF.scala:58)
at org.apache.spark.ml.feature.IDF.validateAndTransformSchema(IDF.scala:68)
at org.apache.spark.ml.feature.IDF.transformSchema(IDF.scala:98)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.IDF.fit(IDF.scala:88)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)