考虑使用library(sparklyr)
library(janeaustenr) # to get some text data
library(stringr)
library(dplyr)
mytext <- austen_books() %>%
mutate(label = as.integer(str_detect(text, 'great'))) #create a fake label variable
mytext_spark <- copy_to(sc, mytext, name = 'mytext_spark', overwrite = TRUE)
# Source: table<mytext_spark> [?? x 3]
# Database: spark_connection
text book label
<chr> <chr> <int>
1 SENSE AND SENSIBILITY Sense & Sensibility 0
2 "" Sense & Sensibility 0
3 by Jane Austen Sense & Sensibility 0
4 "" Sense & Sensibility 0
5 (1811) Sense & Sensibility 0
6 "" Sense & Sensibility 0
7 "" Sense & Sensibility 0
8 "" Sense & Sensibility 0
9 "" Sense & Sensibility 0
10 CHAPTER 1 Sense & Sensibility 0
11 "" Sense & Sensibility 0
12 "" Sense & Sensibility 0
13 The family of Dashwood had long been settled in Sussex. Their estate Sense & Sensibility 0
14 was large, and their residence was at Norland Park, in the centre of Sense & Sensibility 0
15 their property, where, for many generations, they had lived in so Sense & Sensibility 0
16 respectable a manner as to engage the general good opinion of their Sense & Sensibility 0
:
70k
数据框的大小相当小(大约14k
行和naive bayes
个唯一字词。)
现在,在我的群集上训练pipeline
模型只需几秒钟。
首先,我定义pipeline <- ml_pipeline(sc) %>%
ft_regex_tokenizer(input.col='text',
output.col = 'mytoken',
pattern = "\\s+",
gaps =TRUE) %>%
ft_count_vectorizer(input_col = 'mytoken', output_col = 'finaltoken') %>%
ml_naive_bayes( label_col = "label",
features_col = "finaltoken",
prediction_col = "pcol",
probability_col = "prcol",
raw_prediction_col = "rpcol",
model_type = "multinomial",
smoothing = 0,
thresholds = c(1, 1))
naive bayes
然后训练> library(microbenchmark)
> microbenchmark(model <- ml_fit(pipeline, mytext_spark),times = 3)
Unit: seconds
expr min lq mean median uq max neval
model <- ml_fit(pipeline, mytext_spark) 6.718354 6.996424 7.647227 7.274494 8.111663 8.948832 3
模型
tree
现在问题是尝试在同一个(实际上很小的!!)数据集上运行任何基于random forest
的模型(boosted trees
,pipeline2 <- ml_pipeline(sc) %>%
ft_regex_tokenizer(input.col='text',
output.col = 'mytoken',
pattern = "\\s+",
gaps =TRUE) %>%
ft_count_vectorizer(input_col = 'mytoken', output_col = 'finaltoken') %>%
ml_gbt_classifier( label_col = "label",
features_col = "finaltoken",
prediction_col = "pcol",
probability_col = "prcol",
raw_prediction_col = "rpcol",
max_memory_in_mb = 10240,
cache_node_ids = TRUE)
model2 <- ml_fit(pipeline2, mytext_spark)
# wont work :(
等)都行不通。
sparklyr
错误:org.apache.spark.SparkException:作业因阶段失败而中止:阶段69.0中的任务0失败4次,最近失败: 第69.0阶段失去任务0.3(TID 1580,1.1.1.1.1,执行人5): java.lang.IllegalArgumentException:大小超过Integer.MAX_VALUE
我认为这是由于令牌的矩阵表示的稀疏性,但是有什么可以在这里完成的吗?这是spark
问题吗?一个{{1}}问题?我的代码效率不高吗?
谢谢!
答案 0 :(得分:4)
您收到此错误是因为您实际上达到了我们在Spark https://issues.apache.org/jira/browse/SPARK-6235
中的着名2G限制解决方案是在将数据提供给算法之前对其进行重新分区。
这个帖子实际上是两个陷阱:
所以,让我们回顾一下看似无害的代码;
route
那么最后一行是做什么的?
library(janeaustenr) # to get some text data
library(stringr)
mytext <- austen_books() %>%
mutate(label = as.integer(str_detect(text, 'great'))) # create a fake label variable
mytext_spark <- copy_to(sc, mytext, name = 'mytext_spark', overwrite = TRUE)
(不适用于大数据集),实际上只是将本地R数据帧复制到1分区Spark DataFrame
因此,您只需重新分区数据,以确保一旦管道准备好数据,然后输入copy_to
,分区大小就会小于2GB。
因此,您只需执行以下操作即可重新分区数据:
gbt
PS1: # 20 is an arbitrary number I chose to test and it seems to work well in this case,
# you might want to reconsider that if you have a bigger dataset.
mytext_spark <-
copy_to(sc, mytext, name = 'mytext_spark', overwrite = TRUE) %>%
sdf_repartition(partitions = 20)
是您为max_memory_in_mb
计算其统计信息所需的内存量。它与作为输入的数据量无直接关系。
PS2 :如果您没有为执行者设置足够的内存,则可能会遇到gbt
编辑:重新分区数据的含义是什么?
在谈论重新分区之前,我们总是可以参考分区的定义。我会试着做空。
分区是大型分布式数据集的逻辑块。
Spark使用分区来管理数据,这些分区有助于并行化分布式数据处理,同时在执行程序之间发送数据的网络流量最小 默认情况下,Spark会尝试从靠近它的节点将数据读入RDD。由于Spark通常访问分布式分区数据,为了优化转换操作,它创建了用于保存数据块的分区。
增加分区计数将使每个分区拥有更少的数据(或根本没有!)
来源:摘自@JacekLaskowski Mastering Apache Spark book。
但数据分区并不总是正确的,就像在这种情况下一样。所以需要重新分配。 (java.lang.OutOfMemoryError : GC overhead limit exceeded
)sdf_repartition
sparklyr
会在您的节点上分散和随机播放您的数据。即sdf_repartition
将创建20个数据分区,而不是本例中最初的1个分区。
我希望这会有所帮助。
整个代码:
sdf_repartition(20)
答案 1 :(得分:0)
您能否提供完整的错误追溯?
我的猜测是你内存不足。随机森林和gbt树是集合模型,因此它们需要比朴素贝叶斯更多的内存和计算能力。
尝试重新分区数据(spark.sparkContext.defaultParallelism值是一个很好的起点),这样每个工作人员都可以获得更小,更均匀分布的数据块。
如果这不起作用,请尝试将max_memory_in_mb
参数缩减为256
。