初学者PySpark问题在这里。 如何创建一个遍历列中字符串数组的udf
我有一个约6M行的数据帧,我将元素提取到单独的列中。这是一个示例:
SecretKey
输出:
refreshToken(){
AWS.config.update({credentials:{accessKeyId:'access_id',secretAccessKey:'secret'}, region:'us-east-2'});
const myCreds = new AWS.CognitoIdentityCredentials({IdentityPoolId:'us-east-2:identity_pool_id},{region:'us-east-2'});
const myConfig = new AWS.Config({credentials: myCreds, region: 'us-east-2'});
const refreshToken = JSON.parse(localStorage.getItem('rToken'));
const cognitoisp = new CognitoIdentityServiceProvider();
const params = {
AuthFlow: 'REFRESH_TOKEN',
ClientId: 'client_id',
UserPoolId: 'user_pool_id',
AuthParameters: {
'REFRESH_TOKEN': refreshToken
}
}
cognitoisp.adminInitiateAuth(params,(err,data)=>{
if(err) console.log(err,err.stack);
else
{
this.token = data.AuthenticationResult.IdToken;
localStorage.setItem('lcToken',JSON.stringify(this.token));
}
})
}
我想创建一个udf来计算一个'协议'每个提取的列表/数组的得分如下:
from pyspark.sql.types import *
d = [{'1-ID': 'Alice', '2-Full_Classification': ["H02J 1/08 20060101AFI20151217BHEP", "B63H 21/17 20060101ALI20151217BHEP", 'B65D 39/12 20060101A I20051008RMEP'], '3-Section': ['H', 'B', 'B'], '4-Class': ['02', '63', '65'], '5-SubClass': ['J', 'H', 'D']}]
schema = StructType([
StructField("ID", StringType(), True),
StructField("Full_Classification", ArrayType(StringType()), True),
StructField("Section", ArrayType(StringType()), True),
StructField("Class", ArrayType(StringType()), True),
StructField("SubClass", ArrayType(StringType()), True)
])
df = spark.createDataFrame(d)
df.printSchema()
df.show(truncate=100)
编辑:此案例中的所需输出:
root
|-- 1-ID: string (nullable = true)
|-- 2-Full_Classification: array (nullable = true)
| |-- element: string (containsNull = true)
|-- 3-Section: array (nullable = true)
| |-- element: string (containsNull = true)
|-- 4-Class: array (nullable = true)
| |-- element: string (containsNull = true)
|-- 5-SubClass: array (nullable = true)
| |-- element: string (containsNull = true)
+-----+----------------------------------------------------------------------------------------------------+---------+------------+----------+
| 1-ID| 2-Full_Classification|3-Section| 4-Class|5-SubClass|
+-----+----------------------------------------------------------------------------------------------------+---------+------------+----------+
|Alice|[H02J 1/08 20060101AFI20151217BHEP, B63H 21/17 20060101ALI20151217BHEP, B65D 39/12 20060101A I200...|[H, B, B]|[02, 63, 65]| [J, H, D]|
+-----+----------------------------------------------------------------------------------------------------+---------+------------+----------+
这是我的udf:
List1 = ['A', 'A', 'A', 'A'] #agreement score would be 100%
List2 = ['12', '12', '12', '13', '13', '13'] #agreement score would be 50%
List3 = ['C', 'D', 'E'] #agreement score would be 0%
然而,当我执行此代码时,我遇到以下错误消息(第一个是我的完整6M行DF,最后一个是上面的示例df):
AttributeError:' NoneType'对象没有属性' _jvm'
OR
TypeError:' NoneType'对象不可迭代
OR
TypeError:类型'类型'的对象没有len()
我感到困惑,因为每列都包含一个应该可迭代的字符串数组。任何建议超级欣赏