我正在使用Spark 1.3
# Read from text file, parse it and then do some basic filtering to get data1
data1.registerTempTable('data1')
# Read from text file, parse it and then do some basic filtering to get data1
data2.registerTempTable('data2')
# Perform join
data_joined = data1.join(data2, data1.id == data2.id);
我的数据非常偏斜,数据2(几KB)<< data1(GB的10s)和性能相当糟糕。我正在阅读有关广播加入的内容,但不确定如何使用Python API执行相同操作。
答案 0 :(得分:28)
Spark 1.3不支持使用DataFrame进行广播连接。在Spark> = 1.5.0中,您可以使用broadcast
函数来应用广播联接:
from pyspark.sql.functions import broadcast
data1.join(broadcast(data2), data1.id == data2.id)
对于旧版本,唯一的选择是转换为RDD并应用与其他语言相同的逻辑。大概是这样的:
from pyspark.sql import Row
from pyspark.sql.types import StructType
# Create a dictionary where keys are join keys
# and values are lists of rows
data2_bd = sc.broadcast(
data2.map(lambda r: (r.id, r)).groupByKey().collectAsMap())
# Define a new row with fields from both DFs
output_row = Row(*data1.columns + data2.columns)
# And an output schema
output_schema = StructType(data1.schema.fields + data2.schema.fields)
# Given row x, extract a list of corresponding rows from broadcast
# and output a list of merged rows
def gen_rows(x):
return [output_row(*x + y) for y in data2_bd.value.get(x.id, [])]
# flatMap and create a new data frame
joined = data1.rdd.flatMap(lambda row: gen_rows(row)).toDF(output_schema)
答案 1 :(得分:-1)
from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast
spark = SparkSession.builder.appName("Python Spark SQL basic
example").config("spark.some.config.option", "some-value").getOrCreate()
df2 = spark.read.csv("D:\\trans_mar.txt",sep="^");
df1=spark.read.csv("D:\\trans_feb.txt",sep="^");
print(df1.join(broadcast(df2),df2._c77==df1._c77).take(10))