我使用HiveContext(而不是SQLContext)在pyspark中创建了一个名为df的数据框。
但我发现在调用df.cache()之后我将无法调用df.show()。例如:
>>> df.show(2)
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
| bits| dst_ip|dst_port|flow_direction|in_iface|ip_dscp|out_iface| pkts|protocol| src_ip|src_port| tag|
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
|16062594|42.120.84.166| 11291| 1| 3| 36| 2|17606406| pnni|42.120.84.115| 14166|10008|
|13914480|42.120.82.254| 13667| 0| 4| 32| 1|13953516| ax.25| 42.120.86.49| 19810|10002|
+--------+-------------+--------+--------------+--------+-------+---------+--------+--------+-------------+--------+-----+
only showing top 2 rows
>>>
>>> df.cache()
DataFrame[bits: bigint, dst_ip: string, dst_port: bigint, flow_direction: string, in_iface: bigint, ip_dscp: string, out_iface: bigint, pkts: bigint, protocol: string, src_ip: string, src_port: bigint, tag: string]
>>> df.show(2)
16/05/16 15:59:32 ERROR Executor: Exception in task 0.0 in stage 14.0 (TID 14)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/opt/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<stdin>", line 1, in <lambda>
IndexError: list index out of range
但是在调用df.unpersist()之后,df.show()会再次运行
我不明白。因为我认为df.cache()只是缓存RDD供以后使用。为什么调用缓存后df.show()不起作用?
答案 0 :(得分:0)
http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
缓存内存中的数据
Spark SQL可以通过调用sqlContext.cacheTable(&#34; tableName&#34;)或dataFrame.cache()来使用内存中的列式格式来缓存表。然后,Spark SQL将仅扫描所需的列,并自动调整压缩以最小化内存使用和GC压力。您可以调用sqlContext.uncacheTable(&#34; tableName&#34;)从内存中删除表。
可以使用SQLContext上的setConf方法或使用SQL运行SET key = value命令来完成内存中缓存的配置。