应用错误收集

是否需要花费4秒钟来计算13M线？

时间：2015-04-02 22:24:48

标签： apache-spark pyspark

我是新手，现在正在测试pySpark。它比我预期的要慢，我想知道我是否正确设置了它。

我的问题：

我有一个RDD由57个分区（每个~30Mb）组成，所有分区都被缓存（内存中的总大小为1700MB）。 RDD包含13M个字符串，每个字符串约300个字符。所以一般来说不是一个大数据集。那么为什么运行count（）需要4秒钟？

我已经检查了用户界面，似乎对于＆＃39;计数＆＃39;它运行57个任务（如预期的那样），每个任务需要0.6秒，这对我来说似乎很慢。

我在Mesos上运行Google云，拥有1个主服务器和2个从服务器。每个实例有8个内核和30 GB的RAM。

我的问题：

每项任务有0.6秒有意义吗？
根据UI，每个执行者花了18秒运行任务。给定每个节点8个核心，这需要2.25秒。那么我们到底怎么到4秒呢？

守则：

import time
GCS_CONNECTOR_HADOOP_CONF = {
    'fs.gs.impl': 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem',
    'fs.gs.project.id': 'xxx',
    'fs.gs.system.bucket': 'xxx',
    'fs.gs.working.dir': 'spark',
    'fs.gs.auth.service.account.email': 'xxx',
    'fs.gs.auth.service.account.keyfile': 'xxxx'
}


def get_rdd_from_gcs_uris(spark_context,
                          gcs_uris,
                          hadoop_conf,
                          input_format='org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
                          key_type='org.apache.hadoop.io.LongWritable',
                          value_type='org.apache.hadoop.io.Text',
                          key_converter=None):

    rdds = []
    for gcs_uri in gcs_uris:

        rdd = spark_context.newAPIHadoopFile(gcs_uri,
                                             input_format,
                                             key_type,
                                             value_type,
                                             keyConverter=key_converter,
                                             conf=hadoop_conf).cache()
        # we only care about the values, the keys are the byte offsets of each value
        rdd = rdd.map(lambda x: x[1])
        rdds.append(rdd)
    return spark_context.union(rdds)

#Reading files from GCS (I'm reading 6 files)
rdd = get_rdd_from_gcs_uris(sc, gcs_uris, GCS_CONNECTOR_HADOOP_CONF).cache()

#Counting lines for the first time. This is suppose to be slow
rdd.count()

#Counting lines for the second time. It's 10x faster than the first time, but it takes 4 seconds
tic = time.time()
rdd.count()
print('Count took %.2f seconds' % ((time.time() - tic) / 1000))

1 个答案:

答案 0 :(得分：3)

提示：

使用Scala（或Java）代替Python。我对此没有引用，但似乎常识是，连接这两种语言会增加效率低下。每个执行程序都将运行一个Python进程并通过管道与它进行通信。
不要union RDD。您可以将“glob”（例如path/*.csv）传递给newAPIHadoopFile，并返回由匹配的所有文件组成的RDD。（但这不应该在缓存后影响count。）
在Spark UI的“存储”选项卡上，检查缓存了RDD的哪个部分。也许这不是100％。
不要测量秒数。处理更多数据并测量分钟数。 JVM可以花4秒时间做GC。
尝试使用更多或更少的分区。您有57个分区和16个执行程序线程。因此，每个执行程序线程都需要多次请求更多工作。尝试使用16个分区，因此他们只需要问一次。