Pig过滤器由于意外数据而失败

时间:2015-07-23 17:14:24

标签: hadoop cassandra apache-pig

我正在运行Cassandra并且在其中有大约20k的记录。我试图在这个数据上运行一个过滤器,但我收到以下消息:

  

2015-07-23 13:02:23,559 [Thread-4] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0001   java.lang.RuntimeException:com.datastax.driver.core.exceptions.InvalidQueryException:期望8或0字节长(1)           在org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initNextRecordReader(PigRecordReader.java:260)           在org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:205)           at org.apache.hadoop.mapred.MapTask $ NewTrackingRecordReader.nextKeyValue(MapTask.java:532)           at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)           在org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)           在org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)           在org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)           在org.apache.hadoop.mapred.LocalJobRunner $ Job.run(LocalJobRunner.java:212)   引起:com.datastax.driver.core.exceptions.InvalidQueryException:预期8或0字节长(1)           at com.datastax.driver.core.exceptions.InvalidQueryException.copy(InvalidQueryException.java:35)           at com.datastax.driver.core.DefaultResultSetFuture.extractCauseFromExecutionException(DefaultResultSetFuture.java:263)           at com.datastax.driver.core.DefaultResultSetFuture.getUninterruptibly(DefaultResultSetFuture.java:179)           在com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:52)           at com.datastax.driver.core.AbstractSession.execute(AbstractSession.java:44)           在org.apache.cassandra.hadoop.cql3.CqlRecordReader $ RowIterator。(CqlRecordReader.java:259)           在org.apache.cassandra.hadoop.cql3.CqlRecordReader.initialize(CqlRecordReader.java:151)           在org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initNextRecordReader(PigRecordReader.java:256)           ......还有7个

你会认为这是一个明显的错误,相信我谷歌有这么多的结果。很明显,我的某些数据不符合给定列的预期类型。我不明白的是1.)为什么会发生这种情况,以及2.)如何调试它。如果我尝试从我的nodejs应用程序向Cassandra插入无效数据,如果我的数据类型与列数据类型不匹配,它将抛出此类错误,这意味着这不可能?我已经读过使用UTF8的数据验证是不可靠的,设置不同类型的验证是答案,但我不知道如何做到这一点。以下是我重现的步骤:

grunt> define CqlNativeStorage org.apache.cassandra.hadoop.pig.CqlNativeStorage(); grunt> test = load 'cql://blah/blahblah' USING CqlNativeStorage(); grunt> describe test; 13:09:54.544 [main] DEBUG o.a.c.hadoop.pig.CqlNativeStorage - Found ksDef name: blah 13:09:54.544 [main] DEBUG o.a.c.hadoop.pig.CqlNativeStorage - partition keys: ["ad_id"] 13:09:54.544 [main] DEBUG o.a.c.hadoop.pig.CqlNativeStorage - cluster keys: [] 13:09:54.544 [main] DEBUG o.a.c.hadoop.pig.CqlNativeStorage - row key validator: org.apache.cassandra.db.marshal.UTF8Type 13:09:54.544 [main] DEBUG o.a.c.hadoop.pig.CqlNativeStorage - cluster key validator: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type) blahblah: {ad_id: chararray,address: chararray,city: chararray,date_created: long,date_listed: long,fireplace: bytearray,furnished: bytearray,garage: bytearray,neighbourhood: chararray,num_bathrooms: int,num_bedrooms: int,pet_friendly: bytearray,postal_code: chararray,price: double,province: chararray,square_feet: int,url: chararray,utilities_included: bytearray} grunt> query1 = FILTER blahblah BY city == 'New York'; grunt> dump query1;

然后它会运行一段时间并丢弃大量日志并显示错误。

1 个答案:

答案 0 :(得分:1)

发现了我的问题:猪分区与CQL3不匹配,因此数据被错误地解析了。以前,环境变量是PIG_PARTITIONER = org.apache.cassandra.dht.RandomPartitioner。我把它改成PIG_PARTITIONER = org.apache.cassandra.dht.Murmur3Partitioner之后就开始工作了。