Question

您好，并提前感谢您。

我的程序是用java编写的，我无法移动到scala。

我目前正在处理使用以下行从json文件中提取的spark DataFrame：

DataFrame dff = sqlContext.read().json("filePath.son");

SQLContext和SparkContext被正确初始化并完美运行。

问题是json我读取了嵌套结构，我想清理/验证内部数据，而不更改模式。

其中一个数据帧的列特别具有＆＃34; GenericRowWithSchema＆＃34;类型。

让我们说我想清理那个名为＆＃34; data＆＃34;的专栏。

我想到的解决方案是定义名为＆＃34; cleanDataField＆＃34;的用户定义函数（UDF）。然后在列＆＃34;数据＆＃34;上运行它。这是代码：

UDF1<GenericRowWithSchema,GenericRowWithSchema> cleanDataField = new UDF1<GenericRowWithSchema, GenericRowWithSchema>(){

        public GenericRowWithSchema call( GenericRowWithSchema grws){

            cleanGenericRowWithSchema(grws);

            return grws;

        }
    };

然后我会在SQLContext中注册该函数：

sqlContext.udf().register("cleanDataField", cleanDataField, DataTypes.StringType);

然后我会打电话给

df.selectExpr("cleanDataField(data)").show(10, false);

为了查看带有干净数据的前10行。

最后，问题导致：我可以返回复杂数据（例如自定义类对象）吗？如果有可能，我应该怎么做？我想我必须更改udf注册的第3个参数，因为我没有返回一个字符串，但我应该替换它？

谢谢

Answer 1

假设您要构建一个数据类型为TypeError: list[i] is undefined --> item = list[i].split("-");

为此，您可以执行以下操作：

struct<companyid:string,loyaltynum:int,totalprice:int,itemcount:int>

然后，您可以在注册UDF时将该数据类型用作返回类型。

Answer 2

我不知道您的问题是否仍然有效，但以防万一，这是答案：

您需要将第三个参数替换为set

如何使用spark UDF返回复杂类型

2 个答案: