Question

所以，我发现了很多相反的答案，但不是这个。现在听起来很愚蠢，因为Elasticsearch纯粹使用非规范化数据，但这是我们遇到的问题。我们有一个格式如下的表：

+----+--------+--------+--------+--------+---------+
| id | attr_1 | attr_2 | attr_3 | attr_4 | fst_nm  |
+----+--------+--------+--------+--------+---------+
|  1 |   2984 |   0324 |  38432 |        | john    |
|  2 |   2343 |  28347 | 238493 |  34923 | patrick |
|  3 |   3293 |   3823 |  38423 |  34823 | george  |
+----+--------+--------+--------+--------+---------+

当attr_x代表同样的东西时，让我们说当这个表在规范化的世界中分离时，它们是另一个表的外键。因此，所有attrs都存在于一个单独的表中。然而，这些表被去除了，并且它们都被丢弃到一个长表中。通常，加载到Elasticsearch中的问题不是太大，但是这个表很大，大约有1000多列。我们想要使用这些attrs并将它们存储为Elasticsearch中的数组，如下所示：

_source: {
  "id": 1,
  "fst_nm": "john",
  "attrs": [
    2984,
    0324,
    38432
  ]
}

而不是：

_source: {
  "id": 1,
  "fst_nm": "john",
  "attr_1": 2984,
  "attr_2": 0324,
  "attr_3": 38432
}

当我们使用默认的Spark流程时，它只会创建底部的Elasticsearch文档。我有几个想法是创建一个attrs的新表并取消它们，然后通过ID查询该表，以获得attrs，所以它看起来像这样：

+-----+--------+
| id  |  attr  |
+-----+--------+
|   1 |   2984 |
|   1 |   0324 |
|   1 |  38432 |
|   2 |   2343 |
| ... |    ... |
|   3 |  34823 |
+-----+--------+

然后我们可以使用Spark SQL在这个新创建的表上按id查询，获取attrs，但是我们如何使用Spark将它作为数组插入Elasticsearch？

我的另一个想法是在Hive中创建一个新表，并将attrs更改为Hive复杂类型的数组，但我不知道我是怎么做到的。另外，如果我们使用Spark来查询Hive中的表，当结果作为数组返回时，是否可以轻松转储到Elasticsearch中？

Answer 1

对于数据转换部分，您可以使用$scope.confirm = function(){ if($scope.mycheckbox== false) { sessionStorage.setItem("confirmCheck", $scope.mycheckbox); } if($scope.mycheckbox== true){ sessionStorage.setItem("confirmCheck", $scope.mycheckbox); } } if(sessionStorage.confirmCheck == "true"){ $('#mycheckbox').prop('checked',true); }else { $('#mycheckbox').prop('checked',false); } $scope.$on('$destroy',function(){ sessionStorage.confirmCheck=sessionStorage.getItem('confirmCheck'); })将多个列作为数组收集到一个列中，然后可以使用array写入json文件：

.write.json("jsonfile")

写入文件：

import org.apache.spark.sql.functions.col
val attrs = df.columns.filter(_.startsWith("attr")).map(col(_))

val df_array = df.withColumn("attrs", array(attrs:_*)).select("id", "fst_nm", "attrs")

df_array.toJSON.collect
//res8: Array[String] = Array({"id":1,"fst_nm":"john","attrs":[2984,324,38432,null]}, {"id":2,"fst_nm":"patrick","attrs":[2343,28347,238493,34923]})

使用Spark将非规范化Hive表加载到Elasticsearch中

1 个答案: