Spark+Python set GC memory threshold

时间:2017-08-05 11:30:59

标签: python apache-spark memory garbage-collection

I'm trying to run a Python worker (PySpark app) which is using too much memory and my app is getting killed my YARN because of exceeding memory limits (I'm trying to lower memory usage in order to being able to spawn more workers).

I come from Java/Scala, so Python GC works similar than JVM in my head...

Is there a way to tell Python what's the amount of "available memory" it has? I mean, Java GCs when your heap size is almost-full. I want to perform the same operation on Python, so yarn doesn't kill my application because of using too much memory when that memory is garbage (I'm on Python3.3 and there are memory references @ my machine).

I've seen resource hard and soft limits, but no documentation say if GCs trigger on them or not. AFAIK nothing triggers GCs by memory usage, does any1 know a way to do so?

Thanks,

1 个答案:

答案 0 :(得分:1)

CPython (I assume this is the one you use) is significantly different compared to Java. The main garbage collecting method is reference counting. Unless you deal with circular references (IMHO it is not common in normal PySpark workflows) you won't need full GC sweeps at all (data related objects should be collected once data is spilled / pickled).

Spark is also known to kill idle Python workers, even if you enable reuse option, so quite often it skips GC completely.

You can control CPython garbage collecting behavior using set_threshold method:

gc.set_threshold(threshold0[, threshold1[, threshold2]]

or trigger GC sweep manually with collect:

gc.collect(generation=2)

but in my experience most of the GC problems in PySpark come from JVM part, not Python.