我想使用Azure HDI集群上的Spark处理数据(16个节点,96个核心)。数据是blob文本文件(几个文本文件~1000个文件,每个约100k-10M)。虽然集群假设足够,但处理几GB数据需要很长时间(甚至只计算) 我究竟做错了什么?如果我将数据框保存到实木复合地板,它是否会分布在节点上?
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logging.info('start')
sc = SparkContext('spark://headnodehost:7077', 'pyspark')
sqlContext = SQLContext(sc)
logging.info('context: %s', sc)
def get_df(path, sep = '\t', has_header = True):
rdd = sc.textFile(path, 50)
rddsplit = rdd.map(lambda x: [str(xx) for xx in x.split(sep)], preservesPartitioning=True)
if has_header:
header = rddsplit.first()
logging.info(header)
schema = StructType([StructField(h, StringType(), True) for h in header])
rdd_no_header = rddsplit.filter(lambda x: x!=header)
df = sqlContext.createDataFrame(rdd_no_header, schema).persist()
else:
df = sqlContext.createDataFrame(rddsplit).persist()
return df
path = r"wasb://blob_name@storage_name.blob.core.windows.net/path/*/*.tsv"
df = get_df(path)
logging.info('type=%s' , type(df))
df.count().show()
alert_count = df.groupBy('ColumnName').count().show()
谢谢, 哈南