在分布式作业上运行Dask时,我在调度程序上遇到以下错误:
distributed.core - ERROR -
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/distributed/core.py", line 269, in write
frames = protocol.dumps(msg)
File "/usr/local/lib/python3.4/dist-packages/distributed/protocol.py", line 81, in dumps
frames = dumps_msgpack(small)
File "/usr/local/lib/python3.4/dist-packages/distributed/protocol.py", line 153, in dumps_msgpack
payload = msgpack.dumps(msg, use_bin_type=True)
File "/usr/local/lib/python3.4/dist-packages/msgpack/__init__.py", line 47, in packb
return Packer(**kwargs).pack(o)
File "msgpack/_packer.pyx", line 231, in msgpack._packer.Packer.pack (msgpack/_packer.cpp:231)
File "msgpack/_packer.pyx", line 239, in msgpack._packer.Packer.pack (msgpack/_packer.cpp:239)
MemoryError
调度程序或其中一个工作程序的内存是否耗尽?或两者兼而有之?
答案 0 :(得分:2)
此错误的最常见原因是尝试收集过多数据,例如使用dask.dataframe在以下示例中发生:
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- identity transform -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="p">
<xsl:copy>
<xsl:for-each-group select="node()" group-adjacent="string(@class)">
<xsl:choose>
<xsl:when test="current-grouping-key()">
<span class="{current-grouping-key()}">
<xsl:apply-templates select="current-group()/node()"/>
</span>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="current-group()"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
这会将所有数据加载到整个群集的RAM中(这很好),然后尝试通过调度程序将整个结果带回本地计算机(这可能无法处理您的100的GB数据全部在一个地方。)工作者到客户端的通信通过调度程序,因此它是第一台接收所有数据的单机和第一台可能失败的机器。
如果是这种情况,那么您可能希望使用df = dd.read_csv('s3://bucket/lots-of-data-*.csv')
df.compute()
方法来触发计算,但将其留在群集上。
Executor.persist
通常我们只会将df = dd.read_csv('s3://bucket/lots-of-data-*.csv')
df = e.persist(df)
用于我们想要在本地会话中查看的小结果。