如何创建迭代通过一列数组的PySpark UDF:

时间:2018-02-14 16:13:58

标签: python pyspark apache-spark-sql user-defined-functions

初学者PySpark问题在这里。 如何创建一个遍历列中字符串数组的udf

我有一个约6M行的数据帧,我将元素提取到单独的列中。这是一个示例:

SecretKey

输出:

refreshToken(){

    AWS.config.update({credentials:{accessKeyId:'access_id',secretAccessKey:'secret'}, region:'us-east-2'});

    const myCreds = new AWS.CognitoIdentityCredentials({IdentityPoolId:'us-east-2:identity_pool_id},{region:'us-east-2'});

   const myConfig = new AWS.Config({credentials: myCreds, region: 'us-east-2'});



   const refreshToken = JSON.parse(localStorage.getItem('rToken'));
    const cognitoisp = new CognitoIdentityServiceProvider();

    const params = {
      AuthFlow: 'REFRESH_TOKEN',
      ClientId: 'client_id',
      UserPoolId: 'user_pool_id',
      AuthParameters: {
        'REFRESH_TOKEN': refreshToken
      }
    }

    cognitoisp.adminInitiateAuth(params,(err,data)=>{
      if(err) console.log(err,err.stack);

      else

      {
        this.token = data.AuthenticationResult.IdToken;
        localStorage.setItem('lcToken',JSON.stringify(this.token));
      }

    })
  }

我想创建一个udf来计算一个'协议'每个提取的列表/数组的得分如下:

from pyspark.sql.types import *
d = [{'1-ID': 'Alice', '2-Full_Classification': ["H02J 1/08 20060101AFI20151217BHEP", "B63H 21/17 20060101ALI20151217BHEP", 'B65D 39/12 20060101A I20051008RMEP'], '3-Section': ['H', 'B', 'B'],  '4-Class': ['02', '63', '65'], '5-SubClass': ['J', 'H', 'D']}]
schema = StructType([
    StructField("ID", StringType(), True),
    StructField("Full_Classification", ArrayType(StringType()), True),
    StructField("Section", ArrayType(StringType()), True),
    StructField("Class", ArrayType(StringType()), True),
    StructField("SubClass", ArrayType(StringType()), True)
    ])
df = spark.createDataFrame(d)
df.printSchema()
df.show(truncate=100)

编辑:此案例中的所需输出:

root
 |-- 1-ID: string (nullable = true)
 |-- 2-Full_Classification: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- 3-Section: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- 4-Class: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- 5-SubClass: array (nullable = true)
 |    |-- element: string (containsNull = true)
+-----+----------------------------------------------------------------------------------------------------+---------+------------+----------+
| 1-ID|                                                                               2-Full_Classification|3-Section|     4-Class|5-SubClass|
+-----+----------------------------------------------------------------------------------------------------+---------+------------+----------+
|Alice|[H02J 1/08 20060101AFI20151217BHEP, B63H 21/17 20060101ALI20151217BHEP, B65D 39/12 20060101A I200...|[H, B, B]|[02, 63, 65]| [J, H, D]|
+-----+----------------------------------------------------------------------------------------------------+---------+------------+----------+

这是我的udf:

List1 = ['A', 'A', 'A', 'A'] #agreement score would be 100%
List2 = ['12', '12', '12', '13', '13', '13'] #agreement score would be 50%
List3 = ['C', 'D', 'E'] #agreement score would be 0%

然而,当我执行此代码时,我遇到以下错误消息(第一个是我的完整6M行DF,最后一个是上面的示例df):

  

AttributeError:' NoneType'对象没有属性' _jvm'

OR

  

TypeError:' NoneType'对象不可迭代

OR

  

TypeError:类型'类型'的对象没有len()

我感到困惑,因为每列都包含一个应该可迭代的字符串数组。任何建议超级欣赏

0 个答案:

没有答案