我想从我的火花源实施机器学习(kmeans)。我有一个包含2列的表格:审核和标签(正面或负面)一切似乎都很好。但是当我运行预测时,我得到了下一个错误:
SparkException:作业由于阶段失败而中止:22.0阶段中的任务0失败了1次,最近一次失败:22.0阶段中的任务0.0丢失(TID 22,本地主机):org.apache.spark.SparkException:标签不可见>
代码如下:
sc <- spark_connect(master = "local", version="2.0.0")
colnames(dfAllReviews3Cols) = c("ReviewText", "LabelReview")
#db_drop_table(sc, "dfallreviews3cols")
reviewsTbl <- copy_to(sc, dfAllReviews3Cols)
#List tables
src_tbls(sc)
#Select preview
reviews_preview <- dbGetQuery(sc, "SELECT * FROM dfallreviews3cols LIMIT 10")
##KMeans
partitions <- reviewsTbl %>%
sdf_partition(training = 0.7, test = 0.3, seed = 999)
reviewsTbl_training <- partitions$training
reviewTbl_test <- partitions$test
kmeans_model <- reviewsTbl_training %>%
ml_kmeans(ReviewText ~ .)
pred <- sdf_predict(reviewTbl_test, kmeans_model) %>% collect
这是我得到的错误:
pred <-sdf_predict(reviewTbl_test,kmeans_model)%>%收集
错误:org.apache.spark.SparkException:作业由于阶段失败而中止:阶段22.0中的任务0失败1次,最近一次失败:阶段22.0中的任务0.0丢失(TID 22,本地主机):org.apache.spark.SparkException :看不见的标签:交流电不在我的房间里工作修理工来修理时,他无法修理,然后告诉我那是冬天,人们不需要交流电。房间很热。退房是一场噩梦,我的出租车司机正等着要带我去机场两次接待告诉我,我欠了钱,但是在他们检查了他们的记录后,这是不正确的。办理登机手续时,我遇到了同样的问题。贝尔男孩花了20分钟的时间将我的行李从房间里拿下来不推荐这家酒店。at org.apache.spark.ml.feature.StringIndexerModel $$ anonfun $ 4.apply(StringIndexer.scala:169)
在org.apache.spark.ml.feature.StringIndexerModel $$ anonfun $ 4.apply(StringIndexer.scala:165)
在org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(未知来源)
在org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
在org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $ anon $ 1.hasNext(WholeStageCodegenExec.scala:370) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 4.apply(SparkPlan.scala:246) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 4.apply(SparkPlan.scala:240) 位于org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 24.apply(RDD.scala:784) 位于org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 24.apply(RDD.scala:784) 在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:283) 在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) 在org.apache.spark.scheduler.Task.run(Task.scala:85) 在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:274) 在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:615) 在java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1871)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1884)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1897)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1911)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:893)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.collect(RDD.scala:892)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:290)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2183)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2532)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2182)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187)
at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collect$1.apply(Dataset.scala:2187)
at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2545)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2187)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2163)
at sparklyr.Utils$.collect(utils.scala:200)
at sparklyr.Utils.collect(utils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sparklyr.Invoke.invoke(invoke.scala:139)
at sparklyr.StreamHandler.handleMethodCall(stream.scala:123)
at sparklyr.StreamHandler.read(stream.scala:66)
at sparklyr.BackendHandler.channelRead0(handler.scala:51)
at sparklyr.BackendHandler.channelRead0(handler.scala:4)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Unseen label: AC wasn t working in my room When the repair man came to fix it he couldn t and then told me that it s winter and people don t need the AC Room was uncomfortably hot Check out was a nightmare My cab driver was waiting to take me to the airport Twice reception told me I had money to be owed however this was untrue after they checked their records I had the same problem at check in Bell boy took over 20 min to bring my bags down from my room Wouldn t recommend this hotel .
at org.apache
我该如何解决?
感谢提前!
答案 0 :(得分:2)
那真的不是要走的路。根据错误消息,ReviewText
显然是一块非结构化文本。
如果直接将其传递给ml_kmeans
,它将被视为分类变量,并通过StringIndexer
传递(这是发生故障的地方-如果您对详细信息感兴趣,可以查看{{ 3}},但在实践中,这是很难与此有关)。那么结果将组装成相等长度的向量为1
你可以想像,这不是一个很好的模型。
在一般你应该的至少的(这可能没有,可能也不会,足以达到在实践中的好成绩):
Spark ML提供了一小组基本的文本转换器,包括但不限于spark.ml StringIndexer throws 'Unseen label' on fit(),Tokenizer
,StopWordsRemover
,NGram
,以及其他更高级的工具由第三方库提供(最值得注意的是TF-IDF),以及John Snow Labs’ NLP ,其可用于Pipeline
API。我强烈建议您在继续之前阅读这些工具的正式文档。
让我们再回到你的问题,你可以像这样的东西开始
pipeline <- ml_pipeline(
# Tokenize the input
ft_tokenizer(sc, input_col = "ReviewText", output_col = "raw_tokens"),
# Remove stopwords - https://en.wikipedia.org/wiki/Stop_words
ft_stop_words_remover(sc, input_col = "raw_tokens", output_col = "tokens"),
# Apply TF-IDF - https://en.wikipedia.org/wiki/Tf-idf
ft_hashing_tf(sc, input_col = "tokens", output_col = "tfs"),
ft_idf(sc, input_col = "tfs", output_col = "features"),
ml_kmeans(sc, features_col = "features", init_mode = "random")
)
model <- ml_fit(pipeline, reviewsTbl_training)
和调整为适合您的特定情况。