代码:
views = sdf \
.where(sdf['PRODUCT_ID'].isin(PRODUCTS)) \
.rdd \
.groupBy(lambda x: x['SESSION_ID']) \
.toLocalIterator()
for sess_id, rows in views:
# do something
PRODUCTS
是set
。它很大,大约10000件。
代码失败了:
--> 9 for sess_id, rows in views:
/usr/local/spark/python/pyspark/rdd.py in _load_from_socket(port, serializer)
--> 142 for item in serializer.load_stream(rf):
/usr/local/spark/python/pyspark/serializers.py in load_stream(self, stream)
--> 139 yield self._read_with_length(stream)
/usr/local/spark/python/pyspark/serializers.py in _read_with_length(self, stream)
--> 156 length = read_int(stream)
/usr/local/spark/python/pyspark/serializers.py in read_int(stream)
--> 543 length = stream.read(4)
/opt/conda/lib/python3.5/socket.py in readinto(self, b)
574 try:
--> 575 return self._sock.recv_into(b)
576 except timeout:
577 self._timeout_occurred = True
timeout: timed out
但是当我PRODUCTS
设置得更小时,一切都会好起来的。我试图在Spark配置中更改一些超时值。它没有帮助。如何避免这种崩溃?
更新
PRODUCTS = sdf.sort(['TIMESTAMP']).select('PRODUCT_ID').limit(10000).drop_duplicates()
views = sdf \
.join(PRODUCTS, 'PRODUCT_ID', 'inner') \
.rdd \
.groupBy(lambda x: x['SESSION_ID']) \
.toLocalIterator()
for sess_id, rows in views:
# do ...
现在PRODUCTS
是一个数据框。我使用join
。得到了同样的错误..
更新2
尝试此解决方案:
views = sdf \
.join(PRODUCTS, 'PRODUCT_ID', 'inner') \
.rdd \
.groupBy(lambda x: x['SESSION_ID'])
views.cache()
for sess_id, rows in views.toLocalIterator():
pass
一段时间后出现了很长的错误:
Py4JJavaError: An error occurred while calling o289.javaToPython.
: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:194)
....
此错误只出现一次!现在我得到了相同的超时异常!
答案 0 :(得分:1)
我认为这主要是因为toLocalIterator()
中pyspark 2.0.2
的实施过程中存在一个错误。您可以在此处阅读更多内容:[SPARK-18281][SQL][PySpark] Remove timeout for reading data through socket for local iterator。
似乎修复程序将在2.0.2
之后和2.1.x
版本中的下一次更新中提供。如果您想暂时自行修复,可以应用上述问题中的更改:
在rdd.py
的第138行替换此内容(在实际的火花群上,您似乎需要更新rdd.py
内的pyspark.zip
:
try:
rf = sock.makefile("rb", 65536)
for item in serializer.load_stream(rf):
yield item
finally:
sock.close()
用这个:
sock.settimeout(None) # << this is they key line that disables timeout after the initial connection
return serializer.load_stream(sock.makefile("rb", 65536))
答案 1 :(得分:0)
正如@eliasah在评论中所说。您应该尝试加入两个DataFrames以排除PRODUCTS表格中没有的内容。
views = sdf \
.join(PRODUCTS) \
.where(sdf['PRODUCT_ID']) \
.rdd \
.groupBy(lambda x: x['SESSION_ID']) \
.toLocalIterator()