Question

我试图找到使用withColumn的最有效方法，该方法使用列值作为python字典的键。我曾尝试在表达式和文字映射时使用联接。文字映射似乎是最快的。

我希望能够将Pandas UDF与字典查找配合使用，但是我遇到了一些特殊的性能问题。以下面的代码为例：

rows = [(i % 3, i % 100) for i in range(5000000)]
test_df = spark.createDataFrame(rows,['group','col1'])

lookup_table = {}
for i in range(100):
  lookup_table[i] = i + 1.0

lookup_table = sc.broadcast(lookup_table)

def testPandasUDF(df):
  @pandas_udf('double')
  def formulaFunc(x):
    return x + 1

  return df.withColumn('pandas_udf',formulaFunc('col1'))

start = time.time()
r = testPandasUDF(test_df)
r.groupBy(['group']).agg({'pandas_udf': 'sum'}).show()
print('Time taken with no lookup', time.time()-start)

def testPandasUDFWithLookup(df):
  @pandas_udf('double')
  def formulaFunc(x):
    results = [lookup_table.value[x.iat[i]] for i in range(x.size)]
    return pd.Series(results)  

  return df.withColumn('pandas_udf_lookup',formulaFunc('col1'))

start = time.time()
r = testPandasUDFWithLookup(test_df)
r.groupBy(['group']).agg({'pandas_udf_lookup': 'sum'}).show()
print('Time taken with lookup', time.time()-start)

问题在于小型数据集（例如10万行），那么字典查找对运行时的影响可忽略不计。但是，一旦数据集达到5m +行，那么使用字典查找就会发现性能急剧下降。

有什么原因吗？我已经广播到该节点。我假设假设我已经在Python中，那么我不必通过查询python字典来招致额外的上下文开销吗？

这种低迷非常明显，具有5000万行的查找版本要慢20倍（我会指出，标准python函数中的相同查找代码非常快，因此逻辑上没有错）

pySpark Pandas UDF使用字典查找速度超慢

0 个答案: