我正在尝试从s3获取.seq文件。当我尝试使用
阅读它时 SEQorg.apache.hadoop.io.Text"org.apache.hadoop.io.BytesWritable'org.apache.hadoop.io.compress.GzipCodecp
输出 -
sc.sequenceFile[Text,BytesWritable]("s3n://logs/box316_0.seq").take(5).foreach(println)
然后是一堆编码字符。这是什么格式,我该如何解码这个文件?我第一次带hadoop所以请慷慨:)
更新:我试过
Serialization stack: - object not serializable
(class: org.apache.hadoop.io.Text, value: 5) -
field (class: scala.Tuple2, name: _1, type: class java.lang.Object)
- object (class scala.Tuple2, (5,7g 22 73 69 6d 65 43 74 71 9d 90 92 3a .................. – user1579557 5 mins ago
所以数据是Json blob存储在序列文件中,它给了我 -
<?php
$response = json_decode(data(), true);
$flat = flatten($response);
echo $flat['payment_url'], ' | ', $flat['issuer_id'];
function flatten(array $arr) {
$rii = new RecursiveIteratorIterator(new RecursiveArrayIterator($arr), RecursiveIteratorIterator::LEAVES_ONLY );
foreach( $rii as $k=>$v) {
$rv[$k] = $v;
}
return $rv;
}
function data() {
return <<< eoj
[
{
"id": "1dc05e5f-c455-4f06-bc9f-37a2db3a75e1",
"created": "2014-07-14T10:13:50.726519+00:00",
"modified": "2014-07-14T10:13:51.593830+00:00",
"merchant_order_id": "EXAMPLE001",
"status": "new",
"type": "payment",
"amount": 995,
"currency": "EUR",
"description": "Example order #1",
"return_url": "http://www.example.com/",
"transactions": [
{
"id": "90b70bba-e298-4687-a2f2-095f7ebc9392",
"created": "2014-07-14T10:13:51.082946+00:00",
"modified": "2014-07-14T10:13:51.210838+00:00",
"status": "new",
"currency": "EUR",
"amount": 995,
"description": "Example order #1",
"expiration_period": "P30D",
"balance": "internal",
"payment_method": "ideal",
"payment_method_details": {
"issuer_id": "INGBNL2A"
},
"payment_url": "https://api.gingerpayments.com/redirect/90b70bba-e298-4687-a2f2-095f7ebc9392/to/payment/"
}
]
}
]
eoj;
}
答案 0 :(得分:4)
尝试:
val path = "s3n://logs/box316_0.seq"
val seq = sc.sequenceFile[LongWritable,BytesWritable](path)
val usableRDD = seq.map({case (_,v : BytesWritable) => Text.decode(v.getBytes))
答案 1 :(得分:0)
使用序列文件,您必须了解类型。看来你的是Text,BytesWritable。试试这个:
sc.sequenceFile[Text,BytesWritable]("s3n://logs/box316_0.seq").take(5).foreach(println)
答案 2 :(得分:0)
我们经常遇到这个问题,因此我们继续围绕它建立了一个解决方案。我们称之为readSEQ。这使您可以将序列文件读取到Parquet,AVRO或JSON。