现在与Spark玩了大约5个月,因此肯定还是很新。
我有一份工作,我需要寻求一些帮助来确定瓶颈以及如何改进它。
6个节点,30gig RAM,每个8个VCPU。我们也有Hive,Imapala和其他一些东西在运行,所以有很多开销。
火花版本为2.2.1
基本上,该作业采用一个平面的.CSV并将其联接到一个存储的.AVRO文件中,然后进行一些计算,查找“匹配项”,将它们拆分并写入SQL Server。数据倾斜是一个非常非常实际的问题,为了解决这个问题,我重新分区以防止执行者死亡。这项工作效果很好,但我渴望使其变得更好,更快,并能够进一步扩展。
表A的数据组织如下,表B的结构相同。表A为2600万行,并且正在计数,表B可以为50,000行-1m。
TableA-AVRO文件,大小约为3GB,所有字符串。
+---------+--------+------+
|nameA |zipcodeA| IDA |
+---------+--------+------+
|ABC Paint|10001 | |
|ABC Cards|10001 | |
|Mikes Tow|10001 |140000|
|Bobs Stuf|07436 | |
|NOT KNOWN|19428 | |
|NOT KNOWN|08852 |160000|
|Sub SHOP |90001 | |
|BURGER RA|90001 |140000|
+---------+--------+------+
TableB-固定CSV,大小可变,但绝不会超过所有字符串
+---------+--------+------+
|nameB |zipcodeB| IDB |
+---------+--------+------+
|ABC Paint|10001 |100000|
|ABC Card |10001 |120000|
|Mikes Tow|10001 |140000|
|BOS STUFF|07436 |160000|
|XYZ CORP |19428 |100000|
|92122211 |08852 |160000|
|Sub SHOP |90001 |120000|
|BURGER RA|90001 |140000|
+---------+--------+------+
appName = "FINDTHEMATCHES"
conf = SparkConf().setAppName(appName)
sc = SparkContext(conf=conf)
spark = (sql.SparkSession.builder \
.appName(appName)
.getOrCreate())
tableB = spark.read.option("header", "true") \
.option("delimiter", "|") \
.option("inferSchema", "false") \
.csv("s3a://bucket/file.txt")
tableA= spark.read.format("com.databricks.spark.avro") \
.option("inferSchema", "false") \
.load("s3a://bucket2/part-m-00000.avro")
tableAFinal= tableA.select('companyA', 'zipcodeA', 'IDA')
ta = tableAFinal.alias('cfa')
tb = tableB .alias('cfb')
# Here we do a one to many join in order to get all the records sorted lined up.
zipcodematches = ta.join(tb, ta.zipcodeA== tb.zipcodeB, how='inner')
zipematchpart = zipcodematches.repartition(500, 'IDB')
jarow = udf(jaro, FloatType()) ## This is a massive function which I left out since I don't want a 10 page SO question but will put in if requested..
# The function calculates Jaro distance between name strings between the two dataframes, and then gives us our new column
matchcompanies = zipmatchpart.withColumn('MATCHBUS', jarow('companyB', 'companyA'))
matches = matchcompanies.withColumn('MATCHES', f.when(((matchcompanies .MATCHBUS >= 0.91)), 1).otherwise(0))
matches = matches1.where(col('MATCHES') == 1)
IDsA = [x.IDA for x in matches.select('IDA').distinct().collect()]
matchesunique = matches.dropDuplicates(['IDA'])
matchesunique.write.jdbc(
url=sqlurl,
table='dbo.existingrecords',
mode='overwrite',
properties=sqlproperties)
nonmatches = matches.where(col('MATCHES') == 0)
nonmatchesfiltered = nonmatches.filter(~nonmatches.IDB.isin(IDsA))
nonmatchesunique = nonmatchesfiltered.dropDuplicates(['IDB'])
nonmatchesunique.write.jdbc(
url=sqlurl,
table='dbo.newrecordstoinsert',
mode='overwrite',
properties=sqlproperties)
spark.stop()
我要寻找的是,我能以不同的方式做事并且更高效吗?有人能看到Spark作业完全是“不可以”吗?我一直想知道我消除重复数据和非重复数据的方法是否过于昂贵,以及是否有方法进行清理或其他任何方法。一位同事建议从Hive提取数据,而不是使用平面文件,但这实际上使性能 明显更差 ,这令人惊讶,因为它以AVRO形式存储在S3中。 / p>