我正在标记一个月的URL。如果没有堆空间用完,我将无法适应健康。缓存的数据帧看起来仅占用不到400MB的RAM,并且在11个节点上总共有1.9TB的可用空间。
正如您在下面的代码中看到的那样,我以执行程序RAM,驱动程序RAM的形式向其扔了很多空间,并在数据帧和Word2Vec进程中创建了许多分区。我不确定如何提供更多的RAM或将任务分成较小的部分。我觉得我一定缺少有关此错误的基本信息。
myconf = SparkConf().setAll([
("spark.driver.memory", "200G"),
("spark.driver.cores", "10"),
("spark.executor.cores", "10"),
("spark.executor.memory", "200G"),
("spark.driver.maxResultSize","100G"),
("spark.yarn.executor.memoryOverhead","20G"),
("spark.kryoserializer.buffer.max", "1400mb"),
("spark.maxRemoteBlockSizeFetchToMem","1400mb"),
("spark.executor.extraJavaOptions","-XX:+UseConcMarkSweepGC"),
("spark.sql.shuffle.partitions","3000"),
])
appName = "word2vec-test2"
spark = (SparkSession.builder.config(conf=myconf)
.getOrCreate())
interval_df = (spark.read.parquet("/output/ddrml/preprocessed/201906*/")
.select("request")
.withColumnRenamed("request", "uri")
.where(F.col("uri").isNotNull())
.dropDuplicates()
.withColumn("uniqueID",F.monotonically_increasing_id())
.cache())
tokenizer = Tokenizer(inputCol="uri", outputCol="words")
regexTokenizer = RegexTokenizer(inputCol="uri", outputCol="words", pattern="\\W")
regexTokenized = regexTokenizer.transform(interval_df)
regexTokenized = regexTokenized.repartition(2000)
word2Vec = Word2Vec(vectorSize=15, minCount=0, inputCol="words", outputCol="uri_vec", numPartitions=500, maxSentenceLength=10)
model = word2Vec.fit(regexTokenized)
print("Made it through fit")
跟上似乎有些问题。我认为某事正在做一堆垃圾回收。然后它死于Java堆错误。
19/07/17 12:28:43 ERROR scheduler.LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.
19/07/17 12:28:43 WARN scheduler.LiveListenerBus: Dropped 1 SparkListenerEvents since Wed Dec 31 19:00:00 EST 1969
Exception in thread "dispatcher-event-loop-4" java.lang.OutOfMemoryError: Java heap space
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "dispatcher-event-loop-4"
19/07/17 12:30:48 WARN scheduler.LiveListenerBus: Dropped 794 SparkListenerEvents since Wed Jul 17 12:28:43 EDT 2019
19/07/17 12:32:00 WARN server.TransportChannelHandler:
Exception in connection from /192.168.7.29:46960
java.lang.OutOfMemoryError: Java heap space
at sun.reflect.ByteVectorImpl.trim(ByteVectorImpl.java:70)
at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:386)
at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
at sun.reflect.ReflectionFactory.generateConstructor(ReflectionFactory.java:398)
at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:360)
at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1520)
at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:79)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:507)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:482)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:482)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:379)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:669)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1876)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1745)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2033)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
at java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:557)
at org.apache.spark.rpc.netty.NettyRpcEndpointRef.readObject(NettyRpcEnv.scala:495)
at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1158)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:427)
更新1: 我可以通过如下更改conf来运行20天以上的数据。仍然无法使其运行超过30天。另外,当通过历史服务器在UI中查看作业时,在该作业死亡的阶段,我看不到任何失败的执行程序。不知道该怎么做。
myconf = SparkConf().setAll([
("spark.driver.memory", "32G"),
("spark.driver.cores", "10"),
("spark.executor.cores", "5"),
("spark.executor.memory", "15G"),
("spark.rpc.message.maxSize","800"),
("spark.default.parallelism","2000"),
])