有效地存储pyspark数据帧或使用dask数据帧

时间:2019-06-11 15:11:56

标签: pyspark

我正在尝试使用FPGrowth使用大数据集的pyspark将关联规则的输出存储到pandas数据帧中。

这是我正在尝试的代码:

from pyspark.ml.fpm import FPGrowth
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate();
spark = SparkSession(sc)

R = Row('ID', 'items')
df=spark.createDataFrame([R(i, x) for i, x in enumerate(transactions2)])

fpGrowth = FPGrowth(itemsCol="items", minSupport=0.6, minConfidence=0.6)
model = fpGrowth.fit(df)
ar=model.associationRules
association_rules=ar.toPandas()

我得到的错误是:

  Py4JJavaError: An error occurred while calling o115.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 11 in stage 11.0 failed 1 times, most recent failure: Lost task 
11.0 in stage 11.0 (TID 275, localhost, executor driver): 
java.lang.OutOfMemoryError: GC overhead limit exceeded


  ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host

所用系统的规格:

具有24 Vpu和240GB RAm的Windows 10虚拟机, Python 3.6.5,Jupyter Notebook和Pyspark:“ 2.4.3”,Java版本:javac 1.8.0_211

我无法解决该问题,因为我需要将输出存储在.csv文件中。

我要存储在csv中的输出数据框是:

   --------------------+------------+------------------+------------------+

|前任|结果|信心|电梯|

| [B,E,N ... | [A] | 1.0 | 1.0 |

| [B,C,D ... | [A] | 1.0 | 1.0 |

| [E,F,G ... | [B] | 1.0 | 1.0 |

| [A,B,M ... | [C] | 1.0 | 1.0 |

0 个答案:

没有答案