如何将Google云存储桶连接到Apache Drill。我想将Apache Drill连接到谷歌云存储桶,并从存储在这些存储桶中的文件文件中获取数据。
我可以在core-site.xml中指定访问ID和密钥,以便连接到AWS。是否有类似的方法将钻取连接到谷歌云。
答案 0 :(得分:1)
我知道这个问题已经很老了,但仍然是不使用Dataproc的方法。
将GCP连接器中的JAR文件添加到jars / 3rdparty目录。 将以下内容添加到conf目录中的site-core.xml文件中(将大写的值(例如YOUR_PROJECT_ID更改为您自己的详细信息))
<property>
<name>fs.gs.project.id</name>
<value>YOUR_PROJECT_ID</value>
<description>
Optional. Google Cloud Project ID with access to GCS buckets.
Required only for list buckets and create bucket operations.
</description>
</property>
<property>
<name>fs.gs.auth.service.account.private.key.id</name>
<value>YOUR_PRIVATE_KEY_ID</value>
</property>
<property>
<name>fs.gs.auth.service.account.private.key</name>
<value>-----BEGIN PRIVATE KEY-----\nYOUR_PRIVATE_KEY\n-----END PRIVATE KEY-----\n</value>
</property>
<property>
<name>fs.gs.auth.service.account.email</name>
<value>YOUR_SERVICE_ACCOUNT_EMAIL/value>
<description>
The email address is associated with the service account used for GCS
access when fs.gs.auth.service.account.enable is true. Required
when authentication key specified in the Configuration file (Method 1)
or a PKCS12 certificate (Method 3) is being used.
</description>
</property>
<property>
<name>fs.gs.working.dir</name>
<value>/</value>
<description>
The directory relative gs: uris resolve in inside of the default bucket.
</description>
</property>
<property>
<name>fs.gs.implicit.dir.repair.enable</name>
<value>true</value>
<description>
Whether or not to create objects for the parent directories of objects
with / in their path e.g. creating gs://bucket/foo/ upon deleting or
renaming gs://bucket/foo/bar.
</description>
</property>
<property>
<name>fs.gs.glob.flatlist.enable</name>
<value>true</value>
<description>
Whether or not to prepopulate potential glob matches in a single list
request to minimize calls to GCS in nested glob cases.
</description>
</property>
<property>
<name>fs.gs.copy.with.rewrite.enable</name>
<value>true</value>
<description>
Whether or not to perform copy operation using Rewrite requests. Allows
to copy files between different locations and storage classes.
</description>
</property>
启动Apache Drill。
向Drill添加自定义存储。
你很好。
该解决方案来自here,在这里我详细介绍了我们如何使用Apache Drill进行数据探索。
答案 1 :(得分:0)
我发现这里的答案很有用:Apache Drill using Google Cloud Storage
在Google Cloud Dataproc上,您可以使用初始化操作进行设置,如上面的答案所示。还有a complete one you can use为您创建一个GCS插件,默认情况下指向使用您的数据业务集群创建的短暂存储桶。
如果您未使用Cloud Dataproc,则可以在已安装的Drill群集上执行以下操作。
从某处获取GCS connector并将其放入Drill的3rdparty jars目录中。 GCS配置详见上面的链接。在dataproc上,连接器jar位于/ usr / lib / hadoop中,因此上面的初始化操作执行此操作:
# Link GCS connector to drill jars
ln -sf /usr/lib/hadoop/lib/gcs-connector-1.6.0-hadoop2.jar $DRILL_HOME/jars/3rdparty
您还需要配置core-site.xml并使其可供Drill使用。这是必要的,以便Drill知道如何连接到GCS。
# Symlink core-site.xml to $DRILL_HOME/conf
ln -sf /etc/hadoop/conf/core-site.xml $DRILL_HOME/conf
根据需要启动或重新启动drillbits。
Drill启动后,您可以创建一个指向GCS存储桶的新插件。首先写出一个包含插件配置的JSON文件:
export DATAPROC_BUCKET=gs://your-bucket-name
cat > /tmp/gcs_plugin.json <<EOF
{
"config": {
"connection": "$DATAPROC_BUCKET",
"enabled": true,
"formats": {
"avro": {
"type": "avro"
},
"csv": {
"delimiter": ",",
"extensions": [
"csv"
],
"type": "text"
},
"csvh": {
"delimiter": ",",
"extensions": [
"csvh"
],
"extractHeader": true,
"type": "text"
},
"json": {
"extensions": [
"json"
],
"type": "json"
},
"parquet": {
"type": "parquet"
},
"psv": {
"delimiter": "|",
"extensions": [
"tbl"
],
"type": "text"
},
"sequencefile": {
"extensions": [
"seq"
],
"type": "sequencefile"
},
"tsv": {
"delimiter": "\t",
"extensions": [
"tsv"
],
"type": "text"
}
},
"type": "file",
"workspaces": {
"root": {
"defaultInputFormat": null,
"location": "/",
"writable": false
},
"tmp": {
"defaultInputFormat": null,
"location": "/tmp",
"writable": true
}
}
},
"name": "gs"
}
EOF
然后将新插件发布到任何钻头(我假设你在其中一个钻头上运行它):
curl -d@/tmp/gcs_plugin.json \
-H "Content-Type: application/json" \
-X POST http://localhost:8047/storage/gs.json
如果您希望Drill查询多个存储桶,我相信您需要重复此过程更改名称(上面的“gs”)。
然后你可以启动sqlline并检查你是否可以查询该存储桶中的文件。