Question

提前致谢。

您好，我使用spark数据帧和scala进行一些数据处理，我有一个要求，我需要读取具有相同数据类型的多个列，即我的案例中的struct type来自镶木地板文件来处理和创建具有与struct类型相同的模式的新数据帧字段，即field1，field2和field3，并使用所有列中的数据填充数据框，如下所示。

例如假设我有3列

a)column1: struct (nullable = true)
     |-- field1: string (nullable = true)
     |-- field2: string (nullable = true)
     |-- field3: string (nullable = true)

b)column2: struct (nullable = true)
     |-- field1: string (nullable = true)
     |-- field2: string (nullable = true)
     |-- field3: string (nullable = true)

c)column3: struct (nullable = true)
     |-- field1: string (nullable = true)
     |-- field2: string (nullable = true)
     |-- field3: string (nullable = true)

我可以使用下面的代码段

读取列中的所有值

dataframe.select("column1","column2","column3")

上面的代码返回Row对象

[[column1field1,column1field2,column1field3],null,null]
[null,[column2field1,column2field2,column2field3],null]
[null,null,[column3field1,column3field2,column3field3]]
[[column1field1,column1field2,some record, with multiple,separator],null,null]

这里的问题是我能够使用＆＃34;，＆＃34;从行对象中读取值。分隔符，并能够用3个字段填充数据框，但由于字段是字符串，在镶木地板中有记录，我有多个＆＃34;，＆＃34;在字符串数据本身如上面的最后一个Row对象中所示，因此导致数据帧模式出现问题，因为我正在使用＆＃34;，＆＃34;分隔符来检索Row对象的值，它给了我超过3个字段。我怎样才能摆脱这个错误？是否有任何规定可以在Spark中更改Row数组值的对象分隔符以使其得到修复？

Answer 1

是的，您可以使用其他分隔符加载，例如

sqlContext.load("com.databricks.spark.csv", yourSchema, Map("path" -> yourDataPath, "header" -> "false", "delimiter" -> "^"))

OR

sqlContext.read.format("com.databricks.spark.csv").schema(yourSchema).options(Map("path" -> schema, "header" -> "false", "delimiter" -> "^")).load()

取决于您使用的火花版本。

至于字符串中的分隔符，您需要在加载＆＃39;之前删除它们，＆＃39;分隔符或使用不同的分隔符。

Spark Dataframe：行对象分隔符

1 个答案: