我的apache火花使用对吗?

时间:2016-08-19 10:37:36

标签: apache-spark dataframe pyspark

跟进这个question而不是oracle我正在使用hdfs。我正在对普通csv的8G进行跟踪计算。每次我获取结果需要7分钟。我有5个服务器,每个服务器有20G内存。如何减少执行时间?

#loading data from hdfs
df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("dateFormat","dd/MM/yyyy hh:mm:ss.SSS").option("inferSchema", "true").load("hdfs://10.10.10.11:8020/sparkfiles/alarmfiles/export.csv")

更改日期时间

from datetime import datetime
from pyspark.sql.functions import col,udf
from pyspark.sql.types import DateType
from pyspark.sql.types import TimestampType
import re
def datefun(firstoccurrence):
    return datetime.strptime(re.sub(r'\d{3}( .M)$', r'\1', firstoccurrence),'%d-%b-%y %I.%M.%S.%f %p')
dt_conv =  udf (datefun, TimestampType())
groupbyalerts1 = df.groupBy('ALERTGROUP').count().sort('count', ascending=False)

groupbyalerts2 = groupbyalerts1.filter(groupbyalerts1['ALERTGROUP'] != '')

groupbyalerts = groupbyalerts2.filter(groupbyalerts2['ALERTGROUP'] != '0')

groupbyalerts.show(30,False)

在n / w活动期间记录输出:

INFO TaskSetManager: Finished task 4.0 in stage 18.0 (TID 980) in 4080 ms on analytics1.com (4/54)
16/08/19 09:01:37 INFO TaskSetManager: Starting task 7.0 in stage 18.0 (TID 983, analytics2.com, partition 7
16/08/19 16:23:24 INFO BlockManagerInfo: Added broadcast_86_piece0 in memory on analytics2.com:57096 (size: 25.9 KB, free: 8.4 GB)

计算日志:

16/08/19 16:18:28 INFO TaskSetManager: Starting task 22.0 in stage 57.0 (TID 2563, analytics4.com, partition 22,NODE_LOCAL, 2324 bytes)
16/08/19 16:18:28 INFO TaskSetManager: Finished task 18.0 in stage 57.0 (TID 2559) in 28654 ms on analytics4.com (19/54)
16/08/19 16:18:32 INFO TaskSetManager: Starting task 23.0 in stage 57.0 (TID 2564, nanalytics3.com, partition 23,NODE_LOCAL, 2324 bytes)
16/08/19 16:18:32 INFO TaskSetManager: Finished task 20.0 in stage 57.0 (TID 2560) in 31702 ms on analytics3.com (20/54)

1 个答案:

答案 0 :(得分:1)

首先尝试避免模式推断。使用默认设置时,它需要额外的数据扫描,并且比显式转换更昂贵。换句话说,为读者提供架构:

schema  = StructType([...])
df = sqlContext.read.schema(schema).format(...)

同样最好避免使用Python UDF。虽然UDF性能在2.0+上升,但仍然不是最理想的。在这种情况下,您可以使用以下组合轻松替换它:

  • pyspark.sql.functions.regexp_replace
  • pyspark.sql.functions.unix_timestamp,其日期格式简单,可以解析为Long
  • 最后是简单的类型转换到目前为止