为什么我的火花应用突然比以前慢得多?

时间:2016-03-24 01:02:51

标签: apache-spark pyspark spark-streaming datastax datastax-enterprise

这是我的datastax spark app代码的一部分:突然变慢的部分从def统计(rdd,time_span)开始:慢的意思是它现在需要火花应用程序超过480秒才能运行。之前它能够在25秒内运行。我大概只运行大约240万行,所以数据量不是很大。我已经尝试过使用datastax的spark sql。但这并没有帮助,也没有显着提高速度。任何想法为什么这么慢或如何改进我的代码?我使用spark来连接cassandra,我有一个带有3个节点的火花簇,带有321GB。

def statistics(rdd,time_span):
    end=0
    if time_span==1:
        end=0
    elif time_span==3:
        end=1
    elif time_span==6:
        end=3
    elif time_span==12:
        end=6
    elif time_span==24:
        end=12
    elif time_span==72:
        end=24
    else:
        end=72
    article_channels = articlestat.filter(lambda x:x[1]['created_at']+timedelta(hours=8)>=datetime.now()-timedelta(hours=time_span) and x[1]['created_at']+timedelta(hours=8)<=datetime.now()-timedelta(hours=end)).join(channels).map(lambda x:(x[1][0]['id'],{'id':x[1][0]['id'],'thumbnail':x[1][0]['thumbnail'],'title':x[1][0]['title'],'url':x[1][0]['url'],'created_at':x[1][0]['created_at'],'source':x[1][0]['source'],'category':x[1][0]['category'],'author':x[1][1]['name']}))
    speed_rdd = axes.filter(lambda x:x.at+timedelta(hours=8)>=datetime.now()-timedelta(hours=time_span)).map(lambda x:(x.article,[[x.at,x.comments,x.likes,x.reads,x.shares]])) \
                .reduceByKey(lambda x,y:x+y).join(articles).map(lambda x:(x[0],x[1][0],x[1][1][4]-timedelta(hours=8))) \
                .map(lambda x:(x[0],sorted(x[1],key=lambda y:y[0],reverse = True)[0],sorted(x[1],key=lambda y:y[0],reverse = True)[1]) if len(x[1])>=2 else (x[0],sorted(x[1],key=lambda y:y[0],reverse = True)[0],[x[2],0,0,0,0])) \
                .filter(lambda x:(x[1][0]-x[2][0]).seconds>0) \
                .map(lambda x:(x[0],{'id':x[0],'comments':x[1][1],'likes':x[1][2],'reads':x[1][3],'shares':x[1][4],'speed':5*288*((x[1][1]-x[2][1])/((x[1][0]-x[2][0]).seconds/60.0))})) \
                .filter(lambda x:x[1]['comments']>0)
    statistics = article_channels.join(speed_rdd)  \
                .map(lambda x:{'id':x[1][0]['id'],'thumbnail':x[1][0]['thumbnail'],'title':x[1][0]['title'],'url':x[1][0]['url'],'created_at':x[1][0]['created_at'],'source':x[1][0]['source'],'category':x[1][0]['category'],'author':x[1][0]['author'],'comments':x[1][1]['comments'],'likes':x[1][1]['likes'],'reads':x[1][1]['reads'],'shares':x[1][1]['shares'],'speed':x[1][1]['speed']})
    result = statistics.map(lambda x:Row(timespan=str(time_span),source=source,id=x['id'],title=x['title'],thumbnail=x['thumbnail'],url=x['url'],created_at=x['created_at']+timedelta(hours=8),genre='',reads=0,likes=x['likes'],comments=x['comments'],shares=x['shares'],speed=x['speed'],category=x['category'],author=x['author']))
    if result.count()>0:
        session_statis.execute('DELETE FROM tablename WHERE source = %s and timespan= %s', (source,str(time_span)))
        resultschema1 = sqlContext.createDataFrame(result)
        resultschema1.write.format("org.apache.spark.sql.cassandra").options(table="tablename", keyspace = "keyspace").save(mode ="append")
    end = datetime.now(tz)

0 个答案:

没有答案