如何使用Apache Drill从AmazonS3查询数据?

时间:2015-07-30 09:21:25

标签: hadoop amazon-s3 apache-drill

使用Spark将asAsTextFile保存到S3然后格式化为Hadoop。 像这样的文件格式。

enter image description here

格式为YEAR / MONTH / DATE / TIMESTAMP 并且数据在part-0000文件中是json格式。

我配置钻取并指向我的存储桶名称

{
  "type": "file",
  "enabled": true,
  "connection": "s3://com.giaosudau.win-bid",
  "workspaces": {
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": "json"
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null
    }
  },
  "formats": {
    "psv": {
      "type": "text",
      "extensions": [
        "tbl"
      ],
      "delimiter": "|"
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    },
    "tsv": {
      "type": "text",
      "extensions": [
        "tsv"
      ],
      "delimiter": "\t"
    },
    "parquet": {
      "type": "parquet"
    },
    "json": {
      "type": "json"
    },
    "avro": {
      "type": "avro"
    }
  }
}

我选择存储是可以的。但不知道选择并获取日期或小时或文件的数据。

enter image description here

我也尝试过选择文件但不成功的事情:

enter image description here

此处提供文件示例

{"auctionId":"xx","bidRequestString":{"id":"xx","timestamp":"2015-07-29T08:31:00.413Z","isTest":false,"url":"http://www.222.3232.com/","userAgent":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","protocolVersion":"Google Protocol Buffer","exchange":"adx","provider":"Google","location":{"dma":-1,"metro":-1,"timezoneOffsetMinutes":-1},"segments":{"AdxDetectedVerticals":["23:0.571243","355:0.409098","380:0.339647","474:0.415936","540:0.079871"]},"userIds":{"prov":"bGZqaWdsa2dmbmRubWxtamdoaGs","xchg":"bGZqaWdsa2dmbmRubWxtamdoaGs"},"imp":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"spots":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"site":{"id":"502","publisher":{"id":"502"},"page":"http://www.blog.34343.com/"},"device":{"ua":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","ip":"109.103.101.0","geo":{"zip":" 2600"},"language":"en","ext":{"geo_criteria_id":1000142}},"user":{"id":"bGZqaWdsa2dmbmRubWxtamdoaGs","ext":{"cookie_age_seconds":2080089}}},"bidResponseCreative":{"itemId":"fsknxfwe34235235","campaignId":"332","htmlSnippet":"\u003ciframe frameborder=0 scrolling=no width=\"300\" height=\"250\" src=\"//sv.brand-display.com/adedge/api/bd/serving/simple/frame?aukey=34343\u0026_=%CACHEBUSTER%\u0026winning_price=%WINNING_PRICE%\u0026google_click_url=%CLICK_URL_ESC%\u0026encrypt_value=%ENCRYPT_VALUE%\"\u003e\u003c/iframe\u003e","name":"Expandable Web","formatCode":"billboard","bd":"","status":1,"deleted":false,"destinationUrl":"https://ex.sg","tagging":[1,2,3],"expandingDirection":14,"bdUrl":"","format":{"code":"billboard","name":"Billboard","publisher":"default","type":"standard","width":120,"height":600,"expanded_width":0,"expanded_height":0,"collapsed_width":0,"collapsed_height":0,"aspratio":3.76,"expandable":true,"expand_first":true}},"bidResponseCreativeId":"1","bidResponseCreativeName":"","bidResponseData":{"bids":[{"creative":0,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":0},{"creative":1,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":1}]},"bidResponseMeta":"null","bidWinMeta":"null","biddingAgentName":"iface.http","biddingFullAccount":"23848834:strategy","biddingMainAccount":"1212312","biddingMaxPrice":"6071USD/1M","biddingRequestFormatType":"datacratic","biddingSubAccount":"strategy","impIndex":"1","impressionId":"199","pricePriority":"1.000000","rawWinPrice":"100USD/1M","timestamp":"2015-Jul-29 08:31:00.53899","userIds":{"prov":"123232","xchg":"9849839"},"winPrice":"100USD/1M"}
{"auctionId":"343","bidRequestString":{"id":"344","timestamp":"2015-07-28T08:31:00.413Z","isTest":false,"url":"http://www.xx.xx.com/","userAgent":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","protocolVersion":"Google Protocol Buffer","exchange":"adx","provider":"Google","location":{"dma":-1,"metro":-1,"timezoneOffsetMinutes":-1},"segments":{"AdxDetectedVerticals":["23:0.571243","355:0.409098","380:0.339647","474:0.415936","540:0.079871"]},"userIds":{"prov":"bGZqaWdsa2dmbmRubWxtamdoaGs","xchg":"bGZqaWdsa2dmbmRubWxtamdoaGs"},"imp":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"spots":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"site":{"id":"502","publisher":{"id":"502"},"page":"http://www.blog.343.com/"},"device":{"ua":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","ip":"109.103.101.0","geo":{"zip":" 2600"},"language":"en","ext":{"geo_criteria_id":1000142}},"user":{"id":"bGZqaWdsa2dmbmRubWxtamdoaGs","ext":{"cookie_age_seconds":2080089}}},"bidResponseCreative":{"itemId":"fsknxfwe34235235","campaignId":"342342342","htmlSnippet":"\u003ciframe frameborder=0 scrolling=no width=\"300\" height=\"250\" src=\"//sv.brand-display.com/adedge/api/bd/serving/simple/frame?aukey=xxx\u0026_=%CACHEBUSTER%\u0026winning_price=%WINNING_PRICE%\u0026google_click_url=%CLICK_URL_ESC%\u0026encrypt_value=%ENCRYPT_VALUE%\"\u003e\u003c/iframe\u003e","name":"Expandable Web","formatCode":"billboard","bd":"","status":1,"deleted":false,"destinationUrl":"https://ex.sg","tagging":[1,2,3],"expandingDirection":14,"bdUrl":"","format":{"code":"billboard","name":"Billboard","publisher":"default","type":"standard","width":120,"height":600,"expanded_width":0,"expanded_height":0,"collapsed_width":0,"collapsed_height":0,"aspratio":3.76,"expandable":true,"expand_first":true}},"bidResponseCreativeId":"1","bidResponseCreativeName":"","bidResponseData":{"bids":[{"creative":0,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":0},{"creative":1,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":1}]},"bidResponseMeta":"null","bidWinMeta":"null","biddingAgentName":"iface.http","biddingFullAccount":"829838828928932:strategy","biddingMainAccount":"3434","biddingMaxPrice":"6071USD/1M","biddingRequestFormatType":"datacratic","biddingSubAccount":"strategy","impIndex":"1","impressionId":"199","pricePriority":"1.000000","rawWinPrice":"100USD/1M","timestamp":"2015-Jul-29 08:31:00.53899","userIds":{"prov":"cc","xchg":"cc"},"winPrice":"100USD/1M"}

如何从中获取数据?

1 个答案:

答案 0 :(得分:2)

以下是使用您的数据和本地文件系统而不是s3的一些示例。我认为您的文件没有任何扩展名,但如果不是这样,如果您的文件具有.json扩展名,您可以像这样查询它们(使用表别名来解决歧义):

use dfs.`root`;
+-------+---------------------------------------+
|  ok   |                summary                |
+-------+---------------------------------------+
| true  | Default schema changed to [dfs.root]  |
+-------+---------------------------------------+
1 row selected (0.079 seconds)
select t.auctionID from dfs.`/Users/khahn/drill/apache-drill-1.1.0/part-00000` t;
+------------+
| auctionId  |
+------------+
| xx         |
| 343        |
+------------+
2 rows selected (0.093 seconds)

如果您的文件没有扩展名,请修改存储插件配置以将defaultInputFormat设置为“json”;否则,Drill假定文件默认为Parquet格式。例如,我在root工作区中设置了defaultInputFormat:

{
  "type": "file",
  "enabled": true,
  "connection": "file:///",
  "workspaces": {
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": "json"
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null
    }
  },
. . .

当您查询没有扩展名的文件时,您需要在此示例中指定您在其中定义defaultInputFormat,root的工作空间:

select t.auctionID from dfs.`root`.`/Users/khahn/drill/apache-drill-1.1.0/part-00000` t;
+------------+
| auctionId  |
+------------+
| xx         |
| 343        |
+------------+
2 rows selected (0.095 seconds)

请参阅有关处理类型差异的文档:https://drill.apache.org/docs/json-data-model/#handling-type-differences