spark scala:将Struct列的Array转换为String列

时间:2017-06-02 10:41:04

标签: arrays json scala apache-spark

我有一个列,类型为array<结构>从json文件中推断出来。 我想转换数组<结构>到字符串,这样我就可以将这个数组列保持在hive中,并将其作为单个列导出到RDBMS。

temp.json

{"properties":{"items":[{"invoicid":{"value":"923659"},"job_id":
{"value":"296160"},"sku_id":
{"value":"312002"}}],"user_id":"6666","zip_code":"666"}}

处理:

scala> val temp = spark.read.json("s3://check/1/temp1.json")
temp: org.apache.spark.sql.DataFrame = [properties: struct<items:
array<struct<invoicid:struct<value:string>,job_id:struct<value:string>,sku_id:struct<value:string>>>, user_id: string ... 1 more field>]

    scala> temp.printSchema
    root
     |-- properties: struct (nullable = true)
     |    |-- items: array (nullable = true)
     |    |    |-- element: struct (containsNull = true)
     |    |    |    |-- invoicid: struct (nullable = true)
     |    |    |    |    |-- value: string (nullable = true)
     |    |    |    |-- job_id: struct (nullable = true)
     |    |    |    |    |-- value: string (nullable = true)
     |    |    |    |-- sku_id: struct (nullable = true)
     |    |    |    |    |-- value: string (nullable = true)
     |    |-- user_id: string (nullable = true)
     |    |-- zip_code: string (nullable = true)


scala> temp.select("properties").show
+--------------------+
|          properties|
+--------------------+
|[WrappedArray([[9...|
+--------------------+


scala> temp.select("properties.items").show
+--------------------+
|               items|
+--------------------+
|[[[923659],[29616...|
+--------------------+


scala> temp.createOrReplaceTempView("tempTable")

scala> spark.sql("select properties.items  from tempTable").show
+--------------------+
|               items|
+--------------------+
|[[[923659],[29616...|
+--------------------+

我如何得到如下结果:

+-----------------------------------------------------------------------------------------+
|               items                                                                     |
+-----------------------------------------------------------------------------------------+
[{"invoicid":{"value":"923659"},"job_id":{"value":"296160"},"sku_id":{"value":"312002"}}] |
+-----------------------------------------------------------------------------------------+

获取数组元素值而不做任何更改。

1 个答案:

答案 0 :(得分:7)

to_json是您正在寻找的功能

lighthouse_quarter = imread('lighthouse_quarter.png');
lighthouse_quarter = double(lighthouse_quarter);
[rows, columns] = size(lighthouse_quarter); % define rows and columns

for row = 1 :128 : rows
n = 1:128; %define row limits
x_quarter = (1:128); %original row limits
lh_rows_quarter = lighthouse_quarter(n,:); %value of columns at each row
xi_quarter = 1:0.25:128; %new row limits
lh1_quarter = interp1(x_quarter, lh_rows_quarter, xi_quarter, 'linear'); %interpolated row values

for column = 1 : 128 : columns
    m = 1:128; %define column limits
    y_quarter = (1:128); %original row limits
    lh_columns_quarter = lh1_quarter(:,m)'; %value of rows at each column  
    yi_quarter = (1:0.25:128); %new column limits
    lighthouse_quarter_linear = interp1(y_quarter, lh_columns_quarter, yi_quarter, 'linear'); %interpolated column values
end
end
import org.apache.spark.sql.functions.to_json:

val df = spark.read.json(sc.parallelize(Seq("""
  {"properties":{"items":[{"invoicid":{"value":"923659"},"job_id":
  {"value":"296160"},"sku_id":
  {"value":"312002"}}],"user_id":"6666","zip_code":"666"}}""")))


df
  .select(get_json_object(to_json($"properties"), "$.items").alias("items"))
  .show(false)