Question

如何将Google云存储桶连接到Apache Drill。我想将Apache Drill连接到谷歌云存储桶，并从存储在这些存储桶中的文件文件中获取数据。

我可以在core-site.xml中指定访问ID和密钥，以便连接到AWS。是否有类似的方法将钻取连接到谷歌云。

Answer 1

我知道这个问题已经很老了，但仍然是不使用Dataproc的方法。

将GCP连接器中的JAR文件添加到jars / 3rdparty目录。将以下内容添加到conf目录中的site-core.xml文件中（将大写的值（例如YOUR_PROJECT_ID更改为您自己的详细信息））

<property>
    <name>fs.gs.project.id</name>
    <value>YOUR_PROJECT_ID</value>
    <description>
      Optional. Google Cloud Project ID with access to GCS buckets.
      Required only for list buckets and create bucket operations.
    </description>
  </property>
  <property>
    <name>fs.gs.auth.service.account.private.key.id</name>
    <value>YOUR_PRIVATE_KEY_ID</value>
  </property>
    <property>
        <name>fs.gs.auth.service.account.private.key</name>
        <value>-----BEGIN PRIVATE KEY-----\nYOUR_PRIVATE_KEY\n-----END PRIVATE KEY-----\n</value>
    </property>
  <property>
    <name>fs.gs.auth.service.account.email</name>
    <value>YOUR_SERVICE_ACCOUNT_EMAIL/value>
    <description>
      The email address is associated with the service account used for GCS
      access when fs.gs.auth.service.account.enable is true. Required
      when authentication key specified in the Configuration file (Method 1)
      or a PKCS12 certificate (Method 3) is being used.
    </description>
  </property>
  <property>
    <name>fs.gs.working.dir</name>
    <value>/</value>
    <description>
      The directory relative gs: uris resolve in inside of the default bucket.
    </description>
  </property>
   <property>
    <name>fs.gs.implicit.dir.repair.enable</name>
    <value>true</value>
    <description>
      Whether or not to create objects for the parent directories of objects
      with / in their path e.g. creating gs://bucket/foo/ upon deleting or
      renaming gs://bucket/foo/bar.
    </description>
  </property>
   <property>
    <name>fs.gs.glob.flatlist.enable</name>
    <value>true</value>
    <description>
      Whether or not to prepopulate potential glob matches in a single list
      request to minimize calls to GCS in nested glob cases.
    </description>
  </property>
   <property>
    <name>fs.gs.copy.with.rewrite.enable</name>
    <value>true</value>
    <description>
      Whether or not to perform copy operation using Rewrite requests. Allows
      to copy files between different locations and storage classes.
    </description>
  </property>

启动Apache Drill。

向Drill添加自定义存储。

你很好。

该解决方案来自here，在这里我详细介绍了我们如何使用Apache Drill进行数据探索。

Answer 2

我发现这里的答案很有用：Apache Drill using Google Cloud Storage

在Google Cloud Dataproc上，您可以使用初始化操作进行设置，如上面的答案所示。还有a complete one you can use为您创建一个GCS插件，默认情况下指向使用您的数据业务集群创建的短暂存储桶。

如果您未使用Cloud Dataproc，则可以在已安装的Drill群集上执行以下操作。

从某处获取GCS connector并将其放入Drill的3rdparty jars目录中。 GCS配置详见上面的链接。在dataproc上，连接器jar位于/ usr / lib / hadoop中，因此上面的初始化操作执行此操作：

# Link GCS connector to drill jars
ln -sf /usr/lib/hadoop/lib/gcs-connector-1.6.0-hadoop2.jar $DRILL_HOME/jars/3rdparty

您还需要配置core-site.xml并使其可供Drill使用。这是必要的，以便Drill知道如何连接到GCS。

# Symlink core-site.xml to $DRILL_HOME/conf
ln -sf /etc/hadoop/conf/core-site.xml $DRILL_HOME/conf

根据需要启动或重新启动drillbits。

Drill启动后，您可以创建一个指向GCS存储桶的新插件。首先写出一个包含插件配置的JSON文件：

export DATAPROC_BUCKET=gs://your-bucket-name
cat > /tmp/gcs_plugin.json <<EOF
{
    "config": {
        "connection": "$DATAPROC_BUCKET",
        "enabled": true,
        "formats": {
            "avro": {
                "type": "avro"
            },
            "csv": {
                "delimiter": ",",
                "extensions": [
                    "csv"
                ],
                "type": "text"
            },
            "csvh": {
                "delimiter": ",",
                "extensions": [
                    "csvh"
                ],
                "extractHeader": true,
                "type": "text"
            },
            "json": {
                "extensions": [
                    "json"
                ],
                "type": "json"
            },
            "parquet": {
                "type": "parquet"
            },
            "psv": {
                "delimiter": "|",
                "extensions": [
                    "tbl"
                ],
                "type": "text"
            },
            "sequencefile": {
                "extensions": [
                    "seq"
                ],
                "type": "sequencefile"
            },
            "tsv": {
                "delimiter": "\t",
                "extensions": [
                    "tsv"
                ],
                "type": "text"
            }
        },
        "type": "file",
        "workspaces": {
            "root": {
                "defaultInputFormat": null,
                "location": "/",
                "writable": false
            },
            "tmp": {
                "defaultInputFormat": null,
                "location": "/tmp",
                "writable": true
            }
        }
    },
    "name": "gs"
}
EOF

然后将新插件发布到任何钻头（我假设你在其中一个钻头上运行它）：

curl -d@/tmp/gcs_plugin.json \
  -H "Content-Type: application/json" \
  -X POST http://localhost:8047/storage/gs.json

如果您希望Drill查询多个存储桶，我相信您需要重复此过程更改名称（上面的“gs”）。

然后你可以启动sqlline并检查你是否可以查询该存储桶中的文件。

将Apache Drill连接到Google Cloud

2 个答案: