将Apache Drill连接到Google Cloud

时间:2017-01-11 05:27:56

标签: google-cloud-storage apache-drill

如何将Google云存储桶连接到Apache Drill。我想将Apache Drill连接到谷歌云存储桶,并从存储在这些存储桶中的文件文件中获取数据。

我可以在core-site.xml中指定访问ID和密钥,以便连接到AWS。是否有类似的方法将钻取连接到谷歌云。

2 个答案:

答案 0 :(得分:1)

我知道这个问题已经很老了,但仍然是不使用Dataproc的方法。

将GCP连接器中的JAR文件添加到jars / 3rdparty目录。 将以下内容添加到conf目录中的site-core.xml文件中(将大写的值(例如YOUR_PROJECT_ID更改为您自己的详细信息))

<property>
    <name>fs.gs.project.id</name>
    <value>YOUR_PROJECT_ID</value>
    <description>
      Optional. Google Cloud Project ID with access to GCS buckets.
      Required only for list buckets and create bucket operations.
    </description>
  </property>
  <property>
    <name>fs.gs.auth.service.account.private.key.id</name>
    <value>YOUR_PRIVATE_KEY_ID</value>
  </property>
    <property>
        <name>fs.gs.auth.service.account.private.key</name>
        <value>-----BEGIN PRIVATE KEY-----\nYOUR_PRIVATE_KEY\n-----END PRIVATE KEY-----\n</value>
    </property>
  <property>
    <name>fs.gs.auth.service.account.email</name>
    <value>YOUR_SERVICE_ACCOUNT_EMAIL/value>
    <description>
      The email address is associated with the service account used for GCS
      access when fs.gs.auth.service.account.enable is true. Required
      when authentication key specified in the Configuration file (Method 1)
      or a PKCS12 certificate (Method 3) is being used.
    </description>
  </property>
  <property>
    <name>fs.gs.working.dir</name>
    <value>/</value>
    <description>
      The directory relative gs: uris resolve in inside of the default bucket.
    </description>
  </property>
   <property>
    <name>fs.gs.implicit.dir.repair.enable</name>
    <value>true</value>
    <description>
      Whether or not to create objects for the parent directories of objects
      with / in their path e.g. creating gs://bucket/foo/ upon deleting or
      renaming gs://bucket/foo/bar.
    </description>
  </property>
   <property>
    <name>fs.gs.glob.flatlist.enable</name>
    <value>true</value>
    <description>
      Whether or not to prepopulate potential glob matches in a single list
      request to minimize calls to GCS in nested glob cases.
    </description>
  </property>
   <property>
    <name>fs.gs.copy.with.rewrite.enable</name>
    <value>true</value>
    <description>
      Whether or not to perform copy operation using Rewrite requests. Allows
      to copy files between different locations and storage classes.
    </description>
  </property>

启动Apache Drill。

向Drill添加自定义存储。

你很好。

该解决方案来自here,在这里我详细介绍了我们如何使用Apache Drill进行数据探索。

答案 1 :(得分:0)

我发现这里的答案很有用:Apache Drill using Google Cloud Storage

在Google Cloud Dataproc上,您可以使用初始化操作进行设置,如上面的答案所示。还有a complete one you can use为您创建一个GCS插件,默认情况下指向使用您的数据业务集群创建的短暂存储桶。

如果您未使用Cloud Dataproc,则可以在已安装的Drill群集上执行以下操作。

从某处获取GCS connector并将其放入Drill的3rdparty jars目录中。 GCS配置详见上面的链接。在dataproc上,连接器jar位于/ usr / lib / hadoop中,因此上面的初始化操作执行此操作:

# Link GCS connector to drill jars
ln -sf /usr/lib/hadoop/lib/gcs-connector-1.6.0-hadoop2.jar $DRILL_HOME/jars/3rdparty

您还需要配置core-site.xml并使其可供Drill使用。这是必要的,以便Drill知道如何连接到GCS。

# Symlink core-site.xml to $DRILL_HOME/conf
ln -sf /etc/hadoop/conf/core-site.xml $DRILL_HOME/conf

根据需要启动或重新启动drillbits。

Drill启动后,您可以创建一个指向GCS存储桶的新插件。首先写出一个包含插件配置的JSON文件:

export DATAPROC_BUCKET=gs://your-bucket-name
cat > /tmp/gcs_plugin.json <<EOF
{
    "config": {
        "connection": "$DATAPROC_BUCKET",
        "enabled": true,
        "formats": {
            "avro": {
                "type": "avro"
            },
            "csv": {
                "delimiter": ",",
                "extensions": [
                    "csv"
                ],
                "type": "text"
            },
            "csvh": {
                "delimiter": ",",
                "extensions": [
                    "csvh"
                ],
                "extractHeader": true,
                "type": "text"
            },
            "json": {
                "extensions": [
                    "json"
                ],
                "type": "json"
            },
            "parquet": {
                "type": "parquet"
            },
            "psv": {
                "delimiter": "|",
                "extensions": [
                    "tbl"
                ],
                "type": "text"
            },
            "sequencefile": {
                "extensions": [
                    "seq"
                ],
                "type": "sequencefile"
            },
            "tsv": {
                "delimiter": "\t",
                "extensions": [
                    "tsv"
                ],
                "type": "text"
            }
        },
        "type": "file",
        "workspaces": {
            "root": {
                "defaultInputFormat": null,
                "location": "/",
                "writable": false
            },
            "tmp": {
                "defaultInputFormat": null,
                "location": "/tmp",
                "writable": true
            }
        }
    },
    "name": "gs"
}
EOF

然后将新插件发布到任何钻头(我假设你在其中一个钻头上运行它):

curl -d@/tmp/gcs_plugin.json \
  -H "Content-Type: application/json" \
  -X POST http://localhost:8047/storage/gs.json

如果您希望Drill查询多个存储桶,我相信您需要重复此过程更改名称(上面的“gs”)。

然后你可以启动sqlline并检查你是否可以查询该存储桶中的文件。