Question

我有一个安装了PySpark的两个计算机EMR集群，从s3读取数据。代码是一个非常简单的过滤和转换操作，使用sqlContext.readStream.text从存储桶中获取数据。该存储桶大约10TB，由bucket/year/month/day/hour/*组成约75k个文件，*表示最多20个文件，大小为128MB。我通过提供存储桶s3://bucket_name/dir/并让PySpark读取其中的所有文件来启动流式传输任务。现在差不多2个小时，这项工作甚至没有开始消耗来自s3的数据，并且Ganglia报告的网络流量很小。

我不知道为什么这个过程如此缓慢以及如何提高速度，因为目前我支付的机器基本上都处于空闲状态。

当我使用.status和.lastProgress跟踪状态时，我会分别得到以下回复：

{'isDataAvailable': False,
 'isTriggerActive': True,
 'message': 'Getting offsets from FileStreamSource[s3://bucket_name/dir]'}

和

{'durationMs': {'getOffset': 207343, 'triggerExecution': 207343},
 'id': '******-****-****-****-*******',
 'inputRowsPerSecond': 0.0,
 'name': None,
 'numInputRows': 0,
 'processedRowsPerSecond': 0.0,
 'runId': '******-****-****-****-*******',
 'sink': {'description': 'FileSink[s3://dest_bucket_name/results/file_name.csv]'},
 'sources': [{'description': 'FileStreamSource[s3://bucket_name/dir]',
   'endOffset': None,
   'inputRowsPerSecond': 0.0,
   'numInputRows': 0,
   'processedRowsPerSecond': 0.0,
   'startOffset': None}],
 'stateOperators': [],
 'timestamp': '2018-02-19T22:31:13.385Z'}

有什么可能导致数据消耗需要这么长时间的想法？这是正常的行为吗？难道我做错了什么？关于如何改进这一过程的任何提示？

非常感谢任何帮助。感谢。

Answer 1

Spark检查源文件夹中的文件，并尝试发现partitions by checking sub-folders' names to correspond pattern "column-name=column-value"。

由于您的数据按日期分区，因此文件的结构应如下所示：Traceback (most recent call last): File "C:\Users\p4532\Desktop\Paramter_Generation\delete_table.py", line 21, in <module> cursor.execute(table) File "C:\Python27\lib\site-packages\psycopg2\sql.py", line 26, in <module> import sys File "C:\Python27\lib\site-packages\pandas\core\generic.py", line 955, in __nonzero__ .format(self.__class__.__name__)) ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().。

EMR PySpark结构化流式传输需要很长时间才能从大s3存储桶中读取

1 个答案: