我正在尝试使用FPGrowth使用大数据集的pyspark将关联规则的输出存储到pandas数据帧中。
这是我正在尝试的代码:
from pyspark.ml.fpm import FPGrowth
from pyspark.sql import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate();
spark = SparkSession(sc)
R = Row('ID', 'items')
df=spark.createDataFrame([R(i, x) for i, x in enumerate(transactions2)])
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.6, minConfidence=0.6)
model = fpGrowth.fit(df)
ar=model.associationRules
association_rules=ar.toPandas()
我得到的错误是:
Py4JJavaError: An error occurred while calling o115.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 11 in stage 11.0 failed 1 times, most recent failure: Lost task
11.0 in stage 11.0 (TID 275, localhost, executor driver):
java.lang.OutOfMemoryError: GC overhead limit exceeded
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
所用系统的规格:
具有24 Vpu和240GB RAm的Windows 10虚拟机, Python 3.6.5,Jupyter Notebook和Pyspark:“ 2.4.3”,Java版本:javac 1.8.0_211
我无法解决该问题,因为我需要将输出存储在.csv文件中。
我要存储在csv中的输出数据框是:
--------------------+------------+------------------+------------------+
|前任|结果|信心|电梯|
| [B,E,N ... | [A] | 1.0 | 1.0 |
| [B,C,D ... | [A] | 1.0 | 1.0 |
| [E,F,G ... | [B] | 1.0 | 1.0 |
| [A,B,M ... | [C] | 1.0 | 1.0 |