Pyspark:在UDF中传递动态列

时间:2017-11-11 11:06:04

标签: python-3.x hadoop apache-spark dataframe pyspark

尝试使用for循环在UDF中逐个发送列的列表,但是获取错误,即数据帧找不到col_name。目前在列表list_col我们有两列,但它可以更改。所以我想编写一个适用于每列列的代码。在这段代码中,我一次连接一行列,行值为struct格式,即列表中的列表。对于每个空,我必须给予空间。

    list_col=['pcxreport','crosslinediscount']
    def struct_generater12(row):
    list3 = []
    main_str = ''
    if(row is None):
        list3.append(' ')
    else:
        for i in row:
            temp = ''
            if(i is None):
                temp+= ' '
            else:
                for j in i:
                    if (j is None):
                        temp+= ' '
                    else:
                        temp+= str(j)
            list3.append(temp)
    for k in list3:
        main_str +=k
    return main_str


    A = udf(struct_generater12,returnType=StringType())
    # z = addlinterestdetail_FDF1.withColumn("Concated_pcxreport",A(addlinterestdetail_FDF1.pcxreport))
    for i in range(0,len(list_col)-1):
        struct_col='Concate_'
        struct_col+=list_col[i]
        col_name=list_col[i]
        z = addlinterestdetail_FDF1.withColumn(struct_col,A(addlinterestdetail_FDF1.col_name))
        struct_col=''

    z.show()

1 个答案:

答案 0 :(得分:1)

% Merge multiple XLS files into one XLS file [filenames, folder] = uigetfile('*.xls','Select the data file','MultiSelect','on'); % gets directory from any folder % Create output file name in the same folder. outputFileName = fullfile(folder, 'rainfall.xls'); fidOutput = fopen(outputFileName, 'wt'); % open output file to write for k = 1 : length(filenames) % Get this file name. thisFileName = fullfile(folder, filenames{k}); % Open input file: fidInput = fopen(thisFileName); % Read text from it thisText = fread(fidInput, '*char'); % Copy to output file: fwrite(fidOutput, thisText); fclose(fidInput); % close the input file end fclose(fidOutput); 表示该列名为addlinterestdetail_FDF1.col_name,您没有访问变量"col_name"中包含的字符串。

在列上调用col_name时,您可以

  • 直接使用其字符串名称:UDF
  • 或使用pyspark sql函数A(col_name)

    col

您应该考虑使用pyspark sql函数进行连接,而不是编写UDF。首先,让我们创建一个具有嵌套结构的示例数据框:

import pyspark.sql.functions as psf
z = addlinterestdetail_FDF1.withColumn(struct_col,A(psf.col(col_name)))

我们将编写一个包含嵌套列名的字典:

import json
j = {'pcxreport':{'a': 'a', 'b': 'b'}, 'crosslinediscount':{'c': 'c', 'd': None, 'e': 'e'}}
jsonRDD = sc.parallelize([json.dumps(j)])
df = spark.read.json(jsonRDD)
df.printSchema()
df.show()

    root
     |-- crosslinediscount: struct (nullable = true)
     |    |-- c: string (nullable = true)
     |    |-- d: string (nullable = true)
     |    |-- e: string (nullable = true)
     |-- pcxreport: struct (nullable = true)
     |    |-- a: string (nullable = true)
     |    |-- b: string (nullable = true)

    +-----------------+---------+
    |crosslinediscount|pcxreport|
    +-----------------+---------+
    |       [c,null,e]|    [a,b]|
    +-----------------+---------+

现在我们可以“展平”list_col=['pcxreport','crosslinediscount'] list_subcols = dict() for c in list_col: list_subcols[c] = df.select(c+'.*').columns ,将StructType替换为None,然后连接:

' '