解决此问题后: How to limit FPGrowth itemesets to just 2 or 3 我正在尝试使用pyspark将fpgrowth的关联规则输出导出到python中的.csv文件。在运行了将近8-10个小时后,它给出了一个错误。 我的机器有足够的空间和内存。
Association Rule output is like this:
Antecedent Consequent Lift
['A','B'] ['C'] 1
代码在链接中: How to limit FPGrowth itemesets to just 2 or 3 只需再添加一行
ar = ar.coalesce(24)
ar.write.csv('/output', header=True)
使用的配置:
``` conf = SparkConf().setAppName("App")
conf = (conf.setMaster('local[*]')
.set('spark.executor.memory', '200G')
.set('spark.driver.memory', '700G')
.set('spark.driver.maxResultSize', '400G')) #8,45,10
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
这将继续运行,并消耗了1000GB的C:/驱动器
有什么有效的方法可以将输出保存为.CSV格式或.XLSX格式。
错误是:
```The error is:
Py4JJavaError: An error occurred while calling o207.csv.
org.apache.spark.SparkException: Job aborted.at
org.apache.spark.sql.execution.
datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
atorg.apache.spark.sql.execution.datasources.InsertIntoHadoopFs
RelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at
org.apache.spark.sql.execution.command.
DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.
DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:664)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 9.0 failed 1 times, most recent failure: Lost task 10.0 in stage 9.0 (TID 226, localhost, executor driver): java.io.IOException: There is not enough space on the disk
at java.io.FileOutputStream.writeBytes(Native Method)
The progress:
19/07/15 14:12:32 WARN TaskSetManager: Stage 1 contains a task of very large size (26033 KB). The maximum recommended task size is 100 KB.
19/07/15 14:12:33 WARN TaskSetManager: Stage 2 contains a task of very large size (26033 KB). The maximum recommended task size is 100 KB.
19/07/15 14:12:38 WARN TaskSetManager: Stage 4 contains a task of very large size (26033 KB). The maximum recommended task size is 100 KB.
[Stage 5:> (0 + 24) / 24][Stage 6:> (0 + 0) / 24][I 14:14:02.723 NotebookApp] Saving file at /app1.ipynb
[Stage 5:==> (4 + 20) / 24][Stage 6:===> (4 + 4) / 24]
答案 0 :(得分:1)
就像注释中已经提到的那样,您应该尝试避免toPandas(),因为此函数会将所有数据加载到驱动程序。您可以使用pysparks DataFrameWriter来写出数据,但是由于不支持数组,因此必须将数组列(先行和后续)转换为其他格式,然后才能将数据写到csv。将列转换为受支持的类型(例如字符串)的一种方法是concat_ws。
object
输出:
long
您现在可以将数据写入csv:
item["Card_Name"] = data.xpath(".//td[2]/a/text()").get()
这将为每个分区创建一个csv文件。您可以使用以下方法更改分区数:
import pyspark.sql.functions as F
from pyspark.ml.fpm import FPGrowth
df = spark.createDataFrame([
(0, [1, 2, 5]),
(1, [1, 2, 3, 5]),
(2, [1, 2])
], ["id", "items"])
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(df)
ar=model.associationRules.withColumn('antecedent', F.concat_ws('-', F.col("antecedent").cast("array<string>")))\
.withColumn('consequent', F.concat_ws('-', F.col("consequent").cast("array<string>")))
ar.show()
如果由于内存问题导致spark无法写入csv文件,请尝试使用不同数量的分区(在调用ar.write之前),并在必要时用其他工具合并这些文件。