Pig的AvroStorage LOAD从输入中删除unicode字符

时间:2017-01-26 12:08:15

标签: unicode apache-pig avro

我正在使用pig来读取avro文件并在写回之前规范化/转换数据。 avro文件具有以下形式的记录:

{
  "type" : "record",
  "name" : "KeyValuePair",
  "namespace" : "org.apache.avro.mapreduce",
  "doc" : "A key/value pair",
  "fields" : [ {
    "name" : "key",
    "type" : "string",
    "doc" : "The key"
  }, {
    "name" : "value",
    "type" : {
      "type" : "map",
      "values" : "bytes"
    },
    "doc" : "The value"
  } ]
}

我已将AvroTools command-line utilityjq结合使用,将第一条记录转储到JSON:

    $ java -jar avro-tools-1.8.1.jar tojson part-m-00000.avro | ./jq --compact-output 'select(.value.pf_v != null)' | head -n 1 | ./jq .
{
  "key": "some-record-uuid",
  "value": {
    "pf_v": "v1\u0003Basic\u0001slcvdr1rw\u001a\u0004v2\u0003DayWatch\u0001slcva2omi\u001a\u0004v3\u0003Performance\u0001slc1vs1v1w1p1g1i\u0004v4\u0003Fundamentals\u0001snlj1erwi\u001a\u0004v5\u0003My Portfolio\u0001svr1dews1b2b3k1k2\u001a\u0004v0\u00035"
  }
}

我运行以下猪命令:

REGISTER avro-1.8.1.jar
REGISTER json-simple-1.1.1.jar
REGISTER piggybank-0.15.0.jar
REGISTER jackson-core-2.8.6.jar
REGISTER jackson-databind-2.8.6.jar

DEFINE AvroLoader org.apache.pig.piggybank.storage.avro.AvroStorage();
AllRecords = LOAD 'part-m-00000.avro'
    USING AvroLoader()
    AS (key: chararray, value: map[]);

Records = FILTER AllRecords BY value#'pf_v' is not null;

SmallRecords = LIMIT Records 10;
DUMP SmallRecords;

上面最后一个命令的相应记录如下:

...
(some-record-uuid,[pf_v#v03v1Basicslcviv2DayWatchslcva2omiv3Performanceslc1vs1v1w1p1g1i])
...

如您所见,unicode字符已从pf_v值中删除。 unicode字符实际上被用作这些值中的分隔符,因此我需要它们才能将记录完全解析为所需的规范化状态。 unicode字符显然存在于编码的.avro文件中(通过将文件转储到JSON来演示)。是否有人知道在加载记录时让AvroStorage 删除unicode字符的方法?

谢谢!

更新 我还使用Avro的python DataFileReader执行了相同的操作:

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

reader = DataFileReader(open("part-m-00000.avro", "rb"), DatumReader())

for rec in reader:
    if 'some-record-uuid' in rec['key']:
        print rec
        print '--------------------------------------------'
        break

reader.close()

这会打印一个dict,看起来像是替换unicode字符的十六进制字符(最好完全删除它们):

{u'value': {u'pf_v': 'v0\x033\x04v1\x03Basic\x01slcvi\x1a\x04v2\x03DayWatch\x01slcva2omi\x1a\x04v3\x03Performance\x01slc1vs1v1w1p1g1i\x1a'}, u'key': u'some-record-uuid'}

0 个答案:

没有答案