我有一个情绪分析程序,用于使用复发中性网络来预测给定的电影评论是正还是负。我正在为该计划使用Deeplearning4j深度学习库。现在我需要将该程序添加到apache spark管道。
执行此操作时,我有一个扩展MovieReviewClassifier
的类org.apache.spark.ml.classification.ProbabilisticClassifier
,我必须将该类的实例添加到管道中。使用setFeaturesCol(String s)
方法将构建模型所需的功能输入到程序中。我添加的功能是String
格式,因为它们是一组用于情感分析的字符串。但功能应采用org.apache.spark.mllib.linalg.VectorUDT
形式。有没有办法将字符串转换为Vector UDT?
我已经为下面的管道实现附加了我的代码:
public class RNNPipeline {
final static String RESPONSE_VARIABLE = "s";
final static String INDEXED_RESPONSE_VARIABLE = "indexedClass";
final static String FEATURES = "features";
final static String PREDICTION = "prediction";
final static String PREDICTION_LABEL = "predictionLabel";
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("test-client").setMaster("local[2]");
sparkConf.set("spark.driver.allowMultipleContexts", "true");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new SQLContext(javaSparkContext);
// ======================== Import data ====================================
DataFrame dataFrame = sqlContext.read().format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.load("/home/RNN3/WordVec/training.csv");
// Split in to train/test data
double [] dataSplitWeights = {0.7,0.3};
DataFrame[] data = dataFrame.randomSplit(dataSplitWeights);
// ======================== Preprocess ===========================
// Encode labels
StringIndexerModel labelIndexer = new StringIndexer().setInputCol(RESPONSE_VARIABLE)
.setOutputCol(INDEXED_RESPONSE_VARIABLE)
.fit(data[0]);
// Convert indexed labels back to original labels (decode labels).
IndexToString labelConverter = new IndexToString().setInputCol(PREDICTION)
.setOutputCol(PREDICTION_LABEL)
.setLabels(labelIndexer.labels());
// ======================== Train ========================
MovieReviewClassifier mrClassifier = new MovieReviewClassifier().setLabelCol(INDEXED_RESPONSE_VARIABLE).setFeaturesCol("Review");
// Fit the pipeline for training..setLabelCol.setLabelCol.setLabelCol.setLabelCol
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] { labelIndexer, mrClassifier, labelConverter});
PipelineModel pipelineModel = pipeline.fit(data[0]);
}
}
Review是功能列,其中包含要预测为正或负的字符串。
执行代码时出现以下错误:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column Review must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually StringType.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:50)
at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:167)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:167)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:167)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:62)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:121)
at RNNPipeline.main(RNNPipeline.java:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
答案 0 :(得分:2)
中的事实Vector的用户定义类型,允许通过DataFrame轻松与SQL交互。
DataFrame支持许多基本和结构化类型;有关支持的类型列表,请参阅Spark SQL数据类型参考。除了Spark SQL指南中列出的类型之外,DataFrame还可以使用ML Vector类型。
以及您被要求org.apache.spark.sql.types.UserDefinedType<Vector>
您可以通过传递DenseVector
创建的SparseVector
或String
来逃避。
从String
("Review"
???)到Vector
的转换取决于您如何整理数据。
答案 1 :(得分:1)
将String类型转换为verctor UDT的方法是使用word2vec。我必须在spark管道中添加一个word2vec对象来进行转换。