Question

我有一个Spark DataFrame，其中一列是Vector类型。当我在它上面创建一个蜂巢表时，我不知道它等同于哪种类型

CREATE EXTERNAL TABLE mix (
        topicdist ARRAY<DOUBLE>
    )
STORED AS PARQUET
LOCATION 's3://path/to/file.parquet'

表创建似乎工作并返回OK，但是当我尝试

时

select topicdist from mix limit 1

我得到的错误：

Failed with exception java.io.IOException:java.lang.RuntimeException: Unknown hive type info array<double> when searching for field type

Answer 1

Vector是Spark用户定义的类型，它在内部存储为

StructType(Seq(
  StructField("type", ShortType, true), 
  StructField("size",IntegerType, true),
  StructField("indices", ArrayType(IntegerType, true), true),
  StructField("values",ArrayType(DoubleType, true), true)
))

所以你需要：

CREATE EXTERNAL TABLE mix (
  topicdist struct<type:tinyint,size:int,indices:array<int>,values:array<double>>
)
STORED AS PARQUET
LOCATION 's3://path/to/file.parquet'

请注意，生成的列不会被解释为Spark Vector。

Hive相当于表创建时的Spark Vector

1 个答案: