Question

考虑使用library(sparklyr) library(janeaustenr) # to get some text data library(stringr) library(dplyr) mytext <- austen_books() %>% mutate(label = as.integer(str_detect(text, 'great'))) #create a fake label variable mytext_spark <- copy_to(sc, mytext, name = 'mytext_spark', overwrite = TRUE) # Source: table<mytext_spark> [?? x 3] # Database: spark_connection text book label <chr> <chr> <int> 1 SENSE AND SENSIBILITY Sense & Sensibility 0 2 "" Sense & Sensibility 0 3 by Jane Austen Sense & Sensibility 0 4 "" Sense & Sensibility 0 5 (1811) Sense & Sensibility 0 6 "" Sense & Sensibility 0 7 "" Sense & Sensibility 0 8 "" Sense & Sensibility 0 9 "" Sense & Sensibility 0 10 CHAPTER 1 Sense & Sensibility 0 11 "" Sense & Sensibility 0 12 "" Sense & Sensibility 0 13 The family of Dashwood had long been settled in Sussex. Their estate Sense & Sensibility 0 14 was large, and their residence was at Norland Park, in the centre of Sense & Sensibility 0 15 their property, where, for many generations, they had lived in so Sense & Sensibility 0 16 respectable a manner as to engage the general good opinion of their Sense & Sensibility 0：

的这个简单示例

70k

数据框的大小相当小（大约14k行和naive bayes个唯一字词。）

现在，在我的群集上训练pipeline模型只需几秒钟。首先，我定义pipeline <- ml_pipeline(sc) %>% ft_regex_tokenizer(input.col='text', output.col = 'mytoken', pattern = "\\s+", gaps =TRUE) %>% ft_count_vectorizer(input_col = 'mytoken', output_col = 'finaltoken') %>% ml_naive_bayes( label_col = "label", features_col = "finaltoken", prediction_col = "pcol", probability_col = "prcol", raw_prediction_col = "rpcol", model_type = "multinomial", smoothing = 0, thresholds = c(1, 1))

naive bayes

然后训练> library(microbenchmark) > microbenchmark(model <- ml_fit(pipeline, mytext_spark),times = 3) Unit: seconds expr min lq mean median uq max neval model <- ml_fit(pipeline, mytext_spark) 6.718354 6.996424 7.647227 7.274494 8.111663 8.948832 3模型

tree

现在问题是尝试在同一个（实际上很小的!!）数据集上运行任何基于random forest的模型（boosted trees，pipeline2 <- ml_pipeline(sc) %>% ft_regex_tokenizer(input.col='text', output.col = 'mytoken', pattern = "\\s+", gaps =TRUE) %>% ft_count_vectorizer(input_col = 'mytoken', output_col = 'finaltoken') %>% ml_gbt_classifier( label_col = "label", features_col = "finaltoken", prediction_col = "pcol", probability_col = "prcol", raw_prediction_col = "rpcol", max_memory_in_mb = 10240, cache_node_ids = TRUE) model2 <- ml_fit(pipeline2, mytext_spark) # wont work :(等）都行不通。

sparklyr

错误：org.apache.spark.SparkException：作业因阶段失败而中止：阶段69.0中的任务0失败4次，最近失败：第69.0阶段失去任务0.3（TID 1580,1.1.1.1.1，执行人5）： java.lang.IllegalArgumentException：大小超过Integer.MAX_VALUE

我认为这是由于令牌的矩阵表示的稀疏性，但是有什么可以在这里完成的吗？这是spark问题吗？一个{{1}}问题？我的代码效率不高吗？

谢谢！

Answer 1

您收到此错误是因为您实际上达到了我们在Spark https://issues.apache.org/jira/browse/SPARK-6235

中的着名2G限制

解决方案是在将数据提供给算法之前对其进行重新分区。

这个帖子实际上是两个陷阱：

使用本地数据。
Spark中基于树的模型是需要大量内存的。

所以，让我们回顾一下看似无害的代码;

route

那么最后一行是做什么的？

library(janeaustenr) # to get some text data library(stringr) mytext <- austen_books() %>% mutate(label = as.integer(str_detect(text, 'great'))) # create a fake label variable mytext_spark <- copy_to(sc, mytext, name = 'mytext_spark', overwrite = TRUE)（不适用于大数据集），实际上只是将本地R数据帧复制到1分区Spark DataFrame

因此，您只需重新分区数据，以确保一旦管道准备好数据，然后输入copy_to，分区大小就会小于2GB。

因此，您只需执行以下操作即可重新分区数据：

gbt

PS1： # 20 is an arbitrary number I chose to test and it seems to work well in this case, # you might want to reconsider that if you have a bigger dataset. mytext_spark <- copy_to(sc, mytext, name = 'mytext_spark', overwrite = TRUE) %>% sdf_repartition(partitions = 20)是您为max_memory_in_mb计算其统计信息所需的内存量。它与作为输入的数据量无直接关系。

PS2 ：如果您没有为执行者设置足够的内存，则可能会遇到gbt

编辑：重新分区数据的含义是什么？

在谈论重新分区之前，我们总是可以参考分区的定义。我会试着做空。

分区是大型分布式数据集的逻辑块。

Spark使用分区来管理数据，这些分区有助于并行化分布式数据处理，同时在执行程序之间发送数据的网络流量最小   默认情况下，Spark会尝试从靠近它的节点将数据读入RDD。由于Spark通常访问分布式分区数据，为了优化转换操作，它创建了用于保存数据块的分区。

增加分区计数将使每个分区拥有更少的数据（或根本没有！）

来源：摘自@JacekLaskowski Mastering Apache Spark book。

但数据分区并不总是正确的，就像在这种情况下一样。所以需要重新分配。（java.lang.OutOfMemoryError : GC overhead limit exceeded）sdf_repartition

sparklyr会在您的节点上分散和随机播放您的数据。即sdf_repartition将创建20个数据分区，而不是本例中最初的1个分区。

我希望这会有所帮助。

整个代码：

sdf_repartition(20)

Answer 2

您能否提供完整的错误追溯？

我的猜测是你内存不足。随机森林和gbt树是集合模型，因此它们需要比朴素贝叶斯更多的内存和计算能力。

尝试重新分区数据（spark.sparkContext.defaultParallelism值是一个很好的起点），这样每个工作人员都可以获得更小，更均匀分布的数据块。

如果这不起作用，请尝试将max_memory_in_mb参数缩减为256。

如何在Spark中使用稀疏矩阵训练随机森林？

2 个答案: