Spark - 多个csv文件的处理速度非常慢,尽管是强大的集群

时间:2015-11-13 21:12:37

标签: python csv apache-spark apache-spark-sql pyspark

我想使用Azure HDI集群上的Spark处理数据(16个节点,96个核心)。数据是blob文本文件(几个文本文件~1000个文件,每个约100k-10M)。虽然集群假设足够,但处理几GB数据需要很长时间(甚至只计算) 我究竟做错了什么?如果我将数据框保存到实木复合地板,它是否会分布在节点上?

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logging.info('start')

sc = SparkContext('spark://headnodehost:7077', 'pyspark')
sqlContext = SQLContext(sc)
logging.info('context: %s', sc)

def get_df(path, sep = '\t', has_header = True):
    rdd = sc.textFile(path, 50) 
    rddsplit = rdd.map(lambda x: [str(xx) for xx in x.split(sep)],  preservesPartitioning=True)
    if has_header:
        header = rddsplit.first()
        logging.info(header)
        schema = StructType([StructField(h, StringType(), True) for h in header])
        rdd_no_header = rddsplit.filter(lambda x: x!=header)
        df = sqlContext.createDataFrame(rdd_no_header, schema).persist()
    else:
        df = sqlContext.createDataFrame(rddsplit).persist()
    return df

path = r"wasb://blob_name@storage_name.blob.core.windows.net/path/*/*.tsv"
df = get_df(path)
logging.info('type=%s' , type(df))
df.count().show()
alert_count = df.groupBy('ColumnName').count().show()

谢谢, 哈南

0 个答案:

没有答案