Question

我有以下 pyspark 代码。

# Reading contents of a text file into an RDD
data_set_rdd = spark_context.textFile(full_file_path) 

# Read the header line
header = data_set_rdd.first()

# Construct the schema
fields = [StructField(field_name, StringType(), True) for field_name in header.split(",")]
schema = StructType(fields)

# Filter empty lines
non_empty_lines = self.data_set_rdd.filter(lambda line: len(line)>0)

# Filter header record from the RDD
contents_rdd = non_empty_lines.filter(lambda line: line != header)

# Split the RDD into columns based on delimiter
delimited_contents_rdd = contents_rdd.map(lambda k: k.split(delimiter))

# Convert RDD into DataFrame  
contents_df = sqlContext.createDataFrame(delimited_contents_rdd, schema)

# Create a function
def check_value(val):
    if val > 220:
        return 220
    else:
        return val

# Register the function as a pyspark udf
check_val = udf(check_value, IntegerType())

# Add a column to the dataframe which uses the check_val function to perform some computation
final_df = contents_df.withColumn("transformed_voltage", check_val (contents_df["voltage"]))

final_df.show(10)

现在，调用UDF以添加列（withColumn）的步骤需要很长时间才能完成。完成大约需要10分钟。我在我的笔记本电脑上使用8 GB RAM，4个核心运行它。

我的输入文件中只有10条记录。

我的文本文件具有以下架构，其中包含2列[timestamp，voltage]

如果我用像WHEN这样的pyspark内置函数替换UDF，它会在几毫秒内完成。

我原本期望UDF速度慢，但是它们会这么慢吗？

我在这里做错了吗？

任何帮助都会受到赞赏，因为我最终会为我的项目编写自定义UDF。

如何提高pyspark UDF的性能？

0 个答案: