Question

我有以下一段小代码。

# do all the required imports
import pyspark
import pandas as pd
from pyspark.sql.functions import udf
from pyspark.sql import SparkSession

#create a session
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

#fashion_df is described below
fashion_df = pd.read_csv('fashion_df.csv')

#create a UDF
def check_merchant_cat(text):
    if not isinstance(text, str):
        category = "N/A"
        return category

    category = fashion_df[fashion_df['merchant_category']==text]['merchant_category']

    return category

merchant_category_mapping_func = udf(check_merchant_cat)

df = spark.read.csv('datafeed.csv', header=True, inferSchema=True)

processed_df = df.withColumn("merchant_category_mapped", merchant_category_mapping_func(df['merchant_category']))

processed_df.select('merchant_category_mapped', 'merchant_category').show(10)

让我描述一下我要解决的问题。

我有一个fashion_df，基本上是多行（大约1000行），标题如下：

merchant_category,category,sub_category
Dresses & Skirts,Dress,Skirts

我在上面的代码中也提到了datafeed.csv，它有大约100万行。每行都有多列，但很少有感兴趣的列。

基本上，我想遍历datafeed.csv的每一行。然后，我要查看该行的merchant_category列。然后，我想在fashion_df熊猫数据框的“ merchant_category”列中搜索这个商人类别值。鉴于已找到匹配的行，我将其值输入fashion_df中相应匹配行的category列中并返回。

返回的类别值将作为列添加到PySpark中加载的原始数据Feed中。

这是正确的方法吗？

Answer 1

步骤零：导入功能：

from pyspark.sql.functions import *

第一步：创建Spark的DataFrame：

#Instead of: fashion_df = pd.read_csv('fashion_df.csv')
fashion_df = spark.read.csv('fashion_df.csv', header=True, inferSchema=True).withColumnRenamed("merchant_category", "mc")

列重命名只是为了以后简化。

第二步：加入此DataFrame。重要提示：进行左连接，因此以后可以将null映射到“ N / A”类别：

df_with_fashion = df.join(fashion_df, df.merchant_category = fashion_df.mc, 'left')

第三步：创建一个新列，并将空值映射到“ N / A”。

processed_df = df_with_fashion.withColumn("merchant_category_mapped", coalesce(col("merchant_category"), lit("N/A"))

为什么这段代码在PySpark中引发了一个奇怪的错误？这真的是正确的方法吗？

1 个答案: