为什么在spark sql中缓存后不能调用show方法?

时间:2016-05-16 08:10:52

标签: pyspark apache-spark-sql

我使用HiveContext(而不是SQLContext)在pyspark中创建了一个名为df的数据框。

但我发现在调用df.cache()之后我将无法调用df.show()。例如:

>>> df.show(2)
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
|    bits|       dst_ip|dst_port|flow_direction|in_iface|ip_dscp|out_iface|    pkts|protocol|       src_ip|src_port|  tag|
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
|16062594|42.120.84.166|   11291|             1|       3|     36|        2|17606406|    pnni|42.120.84.115|   14166|10008|
|13914480|42.120.82.254|   13667|             0|       4|     32|        1|13953516|   ax.25| 42.120.86.49|   19810|10002|
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
only showing top 2 rows


>>> 
>>> df.cache()
DataFrame[bits: bigint, dst_ip: string, dst_port: bigint, flow_direction: string, in_iface: bigint, ip_dscp: string, out_iface: bigint, pkts: bigint, protocol: string, src_ip: string, src_port: bigint, tag: string]


>>> df.show(2)
16/05/16 15:59:32 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "<stdin>", line 1, in <lambda>
IndexError: list index out of range

但是在调用df.unpersist()之后,df.show()会再次运行

我不明白。因为我认为df.cache()只是缓存RDD供以后使用。为什么调用缓存后df.show()不起作用?

1 个答案:

答案 0 :(得分:0)

http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory

缓存内存中的数据

Spark SQL可以通过调用sqlContext.cacheTable(&#34; tableName&#34;)或dataFrame.cache()来使用内存中的列式格式来缓存表。然后,Spark SQL将仅扫描所需的列,并自动调整压缩以最小化内存使用和GC压力。您可以调用sqlContext.uncacheTable(&#34; tableName&#34;)从内存中删除表。

可以使用SQLContext上的setConf方法或使用SQL运行SET key = value命令来完成内存中缓存的配置。

https://forums.databricks.com/questions/6834/cache-table-advanced-before-executing-the-spark-sq.html#answer-6900