如何从Spark中的s3读取.seq文件

时间:2016-01-15 22:09:50

标签: apache-spark spark-streaming

我正在尝试从s3获取.seq文件。当我尝试使用

阅读它时
    SEQorg.apache.hadoop.io.Text"org.apache.hadoop.io.BytesWritable'org.apache.hadoop.io.compress.GzipCodecp

输出 -

sc.sequenceFile[Text,BytesWritable]("s3n://logs/box316_0.seq").take(5).foreach(println)

然后是一堆编码字符。这是什么格式,我该如何解码这个文件?我第一次带hadoop所以请慷慨:)

更新:我试过

 Serialization stack: - object not serializable 
(class: org.apache.hadoop.io.Text, value: 5) - 
field (class: scala.Tuple2, name: _1, type: class java.lang.Object) 
- object (class scala.Tuple2, (5,7g 22 73 69 6d 65 43 74 71 9d 90 92 3a .................. – user1579557 5 mins ago     

所以数据是Json blob存储在序列文件中,它给了我 -

<?php
$response = json_decode(data(), true);
$flat = flatten($response);

echo $flat['payment_url'], ' | ', $flat['issuer_id'];

function flatten(array $arr) {
    $rii = new RecursiveIteratorIterator(new RecursiveArrayIterator($arr), RecursiveIteratorIterator::LEAVES_ONLY );
    foreach( $rii as $k=>$v) {
        $rv[$k] = $v;
    }
    return $rv;
}


function data() {
    return <<< eoj
[
  {
    "id": "1dc05e5f-c455-4f06-bc9f-37a2db3a75e1",
    "created": "2014-07-14T10:13:50.726519+00:00",
    "modified": "2014-07-14T10:13:51.593830+00:00",
    "merchant_order_id": "EXAMPLE001",
    "status": "new",
    "type": "payment",
    "amount": 995,
    "currency": "EUR",
    "description": "Example order #1",
    "return_url": "http://www.example.com/",
    "transactions": [
      {
        "id": "90b70bba-e298-4687-a2f2-095f7ebc9392",
        "created": "2014-07-14T10:13:51.082946+00:00",
        "modified": "2014-07-14T10:13:51.210838+00:00",
        "status": "new",
        "currency": "EUR",
        "amount": 995,
        "description": "Example order #1",
        "expiration_period": "P30D",
        "balance": "internal",
        "payment_method": "ideal",
        "payment_method_details": {
          "issuer_id": "INGBNL2A"
        },
        "payment_url": "https://api.gingerpayments.com/redirect/90b70bba-e298-4687-a2f2-095f7ebc9392/to/payment/"
      }
    ]
  }
]
eoj;
}

3 个答案:

答案 0 :(得分:4)

尝试:

val path = "s3n://logs/box316_0.seq"
val seq = sc.sequenceFile[LongWritable,BytesWritable](path)
val usableRDD = seq.map({case (_,v : BytesWritable) =>  Text.decode(v.getBytes))

答案 1 :(得分:0)

使用序列文件,您必须了解类型。看来你的是Text,BytesWritable。试试这个:

sc.sequenceFile[Text,BytesWritable]("s3n://logs/box316_0.seq").take(5).foreach(println) 

答案 2 :(得分:0)

我们经常遇到这个问题,因此我们继续围绕它建立了一个解决方案。我们称之为readSEQ。这使您可以将序列文件读取到Parquet,AVRO或JSON。

http://www.intricity.com/readseq/