Question

Apache Drill功能列表提到它可以从Google云端存储查询数据，但我找不到有关如何执行此操作的任何信息。我已经让它在S3上运行良好，但怀疑我在Google云端存储方面缺少一些非常简单的东西。

有没有人有Google云端存储的示例存储插件配置？

由于

中号

Answer 1

这是一个很老的问题，所以我想您要么找到了解决方案，要么继续生活，但是对于任何不使用Dataproc寻求解决方案的人，这是一个解决方案：

将GCP连接器中的JAR文件添加到jars / 3rdparty目录中。
将以下内容添加到conf目录中的site-core.xml文件中（将大写的值，例如YOUR_PROJECT_ID更改为您自己的详细信息）：

<property>
    <name>fs.gs.project.id</name>
    <value>YOUR_PROJECT_ID</value>
    <description>
      Optional. Google Cloud Project ID with access to GCS buckets.
      Required only for list buckets and create bucket operations.
    </description>
  </property>
  <property>
    <name>fs.gs.auth.service.account.private.key.id</name>
    <value>YOUR_PRIVATE_KEY_ID</value>
  </property>
    <property>
        <name>fs.gs.auth.service.account.private.key</name>
        <value>-----BEGIN PRIVATE KEY-----\nYOUR_PRIVATE_KEY\n-----END PRIVATE KEY-----\n</value>
    </property>
  <property>
    <name>fs.gs.auth.service.account.email</name>
    <value>YOUR_SERVICE_ACCOUNT_EMAIL/value>
    <description>
      The email address is associated with the service account used for GCS
      access when fs.gs.auth.service.account.enable is true. Required
      when authentication key specified in the Configuration file (Method 1)
      or a PKCS12 certificate (Method 3) is being used.
    </description>
  </property>
  <property>
    <name>fs.gs.working.dir</name>
    <value>/</value>
    <description>
      The directory relative gs: uris resolve in inside of the default bucket.
    </description>
  </property>
   <property>
    <name>fs.gs.implicit.dir.repair.enable</name>
    <value>true</value>
    <description>
      Whether or not to create objects for the parent directories of objects
      with / in their path e.g. creating gs://bucket/foo/ upon deleting or
      renaming gs://bucket/foo/bar.
    </description>
  </property>
   <property>
    <name>fs.gs.glob.flatlist.enable</name>
    <value>true</value>
    <description>
      Whether or not to prepopulate potential glob matches in a single list
      request to minimize calls to GCS in nested glob cases.
    </description>
  </property>
   <property>
    <name>fs.gs.copy.with.rewrite.enable</name>
    <value>true</value>
    <description>
      Whether or not to perform copy operation using Rewrite requests. Allows
      to copy files between different locations and storage classes.
    </description>
  </property>

启动Apache Drill。

向Drill添加自定义存储。

你很好。

该解决方案来自here，在这里我详细介绍了我们如何使用Apache Drill进行数据探索。

Answer 2

我设法使用在Google Dataproc群集上运行的Apache Drill（1.6.0）在Google云端存储（GCS）中查询拼花数据。为了做到这一点，我采取了以下步骤：

安装Drill并使GCS连接器可访问（这可以用作dataproc的init脚本，只需注意它没有经过测试并依赖于本地zookeeper实例）：

#!/bin/sh
set -x -e
BASEDIR="/opt/apache-drill-1.6.0"
mkdir -p ${BASEDIR}
cd ${BASEDIR}
wget http://apache.mesi.com.ar/drill/drill-1.6.0/apache-drill-1.6.0.tar.gz
tar -xzvf apache-drill-1.6.0.tar.gz
mv apache-drill-1.6.0/* .
rm -rf apache-drill-1.6.0 apache-drill-1.6.0.tar.gz

ln -s /usr/lib/hadoop/lib/gcs-connector-1.4.5-hadoop2.jar ${BASEDIR}/jars/gcs-connector-1.4.5-hadoop2.jar
mv ${BASEDIR}/conf/core-site.xml ${BASEDIR}/conf/core-site.xml.old
ln -s /etc/hadoop/conf/core-site.xml ${BASEDIR}/conf/core-site.xml

drillbit.sh start

set +x +e

连接到Drill控制台，创建一个新的存储插件（称之为gcs），并使用以下配置（注意我从s3配置中复制了大部分内容，进行了微小的更改）：

{
  "type": "file",
  "enabled": true,
  "connection": "gs://myBucketName",
  "config": null,
  "workspaces": {
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": null
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null
    }
  },
  "formats": {
    "psv": {
      "type": "text",
      "extensions": [
        "tbl"
      ],
      "delimiter": "|"
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    },
    "tsv": {
      "type": "text",
      "extensions": [
        "tsv"
      ],
      "delimiter": "\t"
    },
    "parquet": {
      "type": "parquet"
    },
    "json": {
      "type": "json",
      "extensions": [
        "json"
      ]
    },
    "avro": {
      "type": "avro"
    },
    "sequencefile": {
      "type": "sequencefile",
      "extensions": [
        "seq"
      ]
    },
    "csvh": {
      "type": "text",
      "extensions": [
        "csvh"
      ],
      "extractHeader": true,
      "delimiter": ","
    }
  }
}

使用以下语法进行查询（请注意反引号）：
```
select * from gs.`root`.`path/to/data/*` limit 10;
```

Apache Drill使用Google云端存储

2 个答案: