PySpark将df写入csv,“列表索引超出范围”

时间:2019-11-15 18:27:09

标签: python csv pyspark

我遵循了从此线程将数据帧写入csv的所有建议: How to export a table dataframe in PySpark to csv?

每次我都遇到相同的错误。另外,我只是试图在列r.select('movie_id').distinct.collect()中收集唯一值而触发了此错误。同样,在运行r.toPandas()时遇到相同的错误。如果有用的信息,可以在Google Colab上运行。

无论如何,我总是会遇到同样的错误:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-106-a30070d47875> in <module>()
----> 1 r.write.csv('processed_reviews3.csv')

3 frames
/content/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o1331.csv.
: org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
    at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:664)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 57.0 failed 1 times, most recent failure: Lost task 4.0 in stage 57.0 (TID 123, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/content/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/content/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/content/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 393, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/content/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-68-4d5e289af82d>", line 1, in <lambda>
  File "<ipython-input-67-979e14fa72fd>", line 4, in parse_review
  File "<ipython-input-67-979e14fa72fd>", line 4, in <listcomp>
IndexError: list index out of range

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:244)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
    ... 10 more

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167)
    ... 33 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows.
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/content/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
    process()
  File "/content/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/content/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 393, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/content/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/util.py", line 99, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-68-4d5e289af82d>", line 1, in <lambda>
  File "<ipython-input-67-979e14fa72fd>", line 4, in parse_review
  File "<ipython-input-67-979e14fa72fd>", line 4, in <listcomp>
IndexError: list index out of range

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:592)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:575)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:244)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
    at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
    ... 10 more

我不确定列表索引超出范围是什么或如何解决。

编辑:添加用户定义的函数和上下文的其他信息。

原始数据来自文本文件,如下所示:

data = sc.textFile("/content/gdrive/Shared drives/Team_561/data/movies.txt")
data.take(18)

>>>>

['product/productId: B003AI2VGA',
 'review/userId: A141HP4LYPWMSR',
 'review/profileName: Brian E. Erland "Rainbow Sphinx"',
 'review/helpfulness: 7/7',
 'review/score: 3.0',
 'review/time: 1182729600',
 'review/summary: "There Is So Much Darkness Now ~ Come For The Miracle"',
 'review/text: Synopsis: On the daily trek from Juarez, Mexico to El Paso, Texas an ever increasing number of female workers are found raped and murdered in the surrounding desert. Investigative reporter Karina Danes (Minnie Driver) arrives from Los Angeles to pursue the story and angers both the local police and the factory owners who employee the undocumented aliens with her pointed questions and relentless quest for the truth.<br /><br />Her story goes nationwide when a young girl named Mariela (Ana Claudia Talancon) survives a vicious attack and walks out of the desert crediting the Blessed Virgin for her rescue. Her story is further enhanced when the "Wounds of Christ" (stigmata) appear in her palms. She also claims to have received a message of hope for the Virgin Mary and soon a fanatical movement forms around her to fight against the evil that holds such a stranglehold on the area.<br /><br />Critique: Possessing a lifelong fascination with such esoteric matters as Catholic mysticism, miracles and the mysterious appearance of the stigmata, I was immediately attracted to the \'05 DVD release `Virgin of Juarez\'. The film offers a rather unique storyline blending current socio-political concerns, the constant flow of Mexican migrant workers back and forth across the U.S./Mexican border and the traditional Catholic beliefs of the Hispanic population. I must say I was quite surprised by the unexpected route taken by the plot and the means and methods by which the heavenly message unfolds.<br /><br />`Virgin of Juarez\' is not a film that you would care to watch over and over again, but it was interesting enough to merit at least one viewing. Minnie Driver delivers a solid performance and Ana Claudia Talancon is perfect as the fragile and innocent visionary Mariela. Also starring Esai Morales and Angus Macfadyen (Braveheart).',
 '',
 'product/productId: B003AI2VGA',
 'review/userId: A328S9RN3U5M68',
 'review/profileName: Grady Harp',
 'review/helpfulness: 4/4',
 'review/score: 3.0',
 'review/time: 1181952000',
 'review/summary: Worthwhile and Important Story Hampered by Poor Script and Production',
 "review/text: THE VIRGIN OF JUAREZ is based on true events surrounding the crime problems of Juarez, Mexico reflected in the gringo exploitation of businesses in neighboring El Paso, Texas.  The story contains many important facts that desperately need to be brought into the light, but the impact of the film falters because of the choices made by the writer and director.<br /><br />Karina Danes (Minnie Driver) is a journalist for a Los Angeles newspaper who has flown to Juarez to investigate the multiple (in the hundreds) killings of young women. The targets for these murders seem to be young women working in the US sponsored sweatshops in Juarez who are picked up at night after work, raped, beaten and killed.  Danes is convinced the Juarez police force is doing nothing and takes on the mission of exposing the tragedies, in part due to her own past issues of being too idle with similar crimes in the US.  She meets Father Herrera (Esai Morales) and a community activist Patrick (Angus MacFadyen) and together they probe the police files and follow the most recent murder, discovering along the way a survivor named Mariela (Ana Claudia Talanc&oacute;n), a frightened young girl whose memory of her rape and beating is erased by her apparent vision of the Virgin Mary.  A father of one of the victims, Isidro (Jorge Cervera, Jr.) nurtures Mariela and helps her to escape the hospital, placing her in a 'church' where she becomes a 'saint' to the people of Juarez who long for the crimes to end.  Mariela appears to the public with the stigmata of bleeding hands and offers hope to the victims' families.  Danes works hard to discover evidence that will expose the perpetrators, taking a sheet of photos of 'most wanted men' from the police office of Detective Lauro (Jacob Vargas), and works with the police and Father Herrera to resolve the tragic chain of events that continue in Juarez. Fearing for Mariela's life, they transport her to Los Angeles where mysterious events end the story.<br /><br />The squeaky, mawkish script was written by Michael Fallon and directed by Kevin James Dobson.  Had their vision been more directed toward defining the line between realism and fanaticism, the story would possibly have been better related.  There are some good performances by Driver, Talanc&oacute;n, Morales, and Vargas but the minor roles vary in quality.  Reporting atrocities such as the one this film addresses is a valid and valuable contribution of contemporary cinema.  It is sad when script and the production dull the impact.  Grady Harp, June 07",
 '']

由于这是一个多行问题,我需要用重复出现的\nproduct来界定。

parsed = sc.newAPIHadoopFile(
    '/content/gdrive/Shared drives/Team_561/data/movies.txt',
    'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'org.apache.hadoop.io.Text',
    conf={'textinputformat.record.delimiter': '\nproduct'}

接下来,我创建了以下函数以从元素中提取user_id,movie_id和等级。

def parse_review(rev):
  # line = rev[0][1]
  attrs = [attr for attr in rev.split('\n')]
  values = [attr.split(':')[1] for attr in attrs[:-1]]
  return (values[0], values[1], values[4])

ratings = parsed.map(lambda l: parse_review(l[1]))

col_names = ['user_id','movie_id','rating']
r = ratings.toDF(col_names)

r,一个数据框,看起来像这样:

+-----------+---------------+------+
|    user_id|       movie_id|rating|
+-----------+---------------+------+
| B003AI2VGA| A141HP4LYPWMSR|   3.0|
| B003AI2VGA| A328S9RN3U5M68|   3.0|
| B003AI2VGA| A1I7QGUDP043DG|   5.0|
| B003AI2VGA| A1M5405JH9THP9|   3.0|
| B003AI2VGA|  ATXL536YX71TR|   3.0|
| B003AI2VGA| A3QYDL5CDNYN66|   2.0|
| B003AI2VGA|  AQJVNDW6YZFQS|   1.0|
| B00006HAXW|  AD4CDZK7D31XP|   5.0|
| B00006HAXW| A3Q4S5DFVPB70D|   5.0|
| B00006HAXW| A2P7UB02HAVEPB|   5.0|
| B00006HAXW| A2TX99AZKDK0V7|   4.0|
| B00006HAXW|  AFC8IKR407HSK|   5.0|
| B00006HAXW| A1FRPGQYQTAOR1|   5.0|
| B00006HAXW| A1RSDE90N6RSZF|   5.0|
| B00006HAXW| A1OUBOGB5970AO|   4.0|
| B00006HAXW| A3NPHQVIY59Y0Y|   5.0|
| B00006HAXW|  AFKMBAY28XO8A|   5.0|
| B00006HAXW|  A66KMXH9V7OGU|   5.0|
| B00006HAXW|  AFJ27ZV9183B8|   5.0|
| B00006HAXW|  AXMKAXC0TR9AW|   5.0|
+-----------+---------------+------+
only showing top 20 rows

0 个答案:

没有答案