Apache Drill处理cp1252字符代码

时间:2017-04-12 14:11:22

标签: encoding apache-drill

我们作为csv的一部分查询的数据包含cp1252字符代码,apache drill提供以下错误:

org.apache.drill.common.exceptions.UserRemoteException:SYSTEM ERROR:MalformedInputException:输入长度= 1片段0:0 [错误ID:53bc07e3-a6e4-4301-a858-205be382275e on 172.16.243.116:31010](java .lang.RuntimeException)java.nio.charset.MalformedInputException:输入长度= 1 org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper.decodeUT8():185 org.apache.drill.exec.expr.fn。 impl.CharSequenceWrapper.setBuffer():119 org.apache.drill.exec.test.generated.FiltererGen174.doEval():50 org.apache.drill.exec.test.generated.FiltererGen174.filterBatchNoSV():100 org.apache .drill.exec.test.generated.FiltererGen174.filterBatch():73 org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.doWork():81 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext ():93 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch。 next():109 org.apache.drill.exec.record.AbstractSi ngleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext():115 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill .exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache。 drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():93 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next(): 119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.project.ProjectRecordBatch .innerNext():135 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.physical.impl.BaseRootExec.next():104 org.apache.drill.exec。 physical.impl.ScreenCreator $ ScreenRoot.innerNext():81 org.apache.drill.exec.physical.impl.Ba seRootExec.next():94 org.apache.drill.exec.work.fragment.FragmentExecutor $ 1.run():232 org.apache.drill.exec.work.fragment.FragmentExecutor $ 1.run():226 java.security .AccessController.doPrivileged(): - 2 javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.UserGroupInformation.doAs():1657 org.apache.drill.exec.work.fragment.FragmentExecutor .run():226 org.apache.drill.common.SelfCleaningRunnable.run():38 java.util.concurrent.ThreadPoolExecutor.runWorker():1142 java.util.concurrent.ThreadPoolExecutor $ Worker.run():617 java .lang.Thread.run():745引起者(java.nio.charset.MalformedInputException)输入长度= 1 java.nio.charset.CoderResult.throwException():281 org.apache.drill.exec.expr.fn。 impl.CharSequenceWrapper.decodeUT8():183 org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper.setBuffer():119 org.apache.drill.exec.test.generated.FiltererGen174.doEval():50 org .apache.drill.exec.test.generated.FiltererGen174.filterBatchNoSV():100 org.apache.drill.exec.test.generated.Filter erGen174.filterBatch():73 org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.doWork():81 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93 org.apache.drill .exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache。 drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext():115 org.apache.drill.exec.record.AbstractRecordBatch.next(): 162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext() :51 org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():93 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record。 AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next() :109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135 org.apache.drill.exec.record。 AbstractRecordBatch.next():162 org.apache.drill.exec.physical.impl.BaseRootExec.next():104 org.apache.drill.exec.physical.impl.ScreenCreator $ ScreenRoot.innerNext():81 org.apache .drill.exec.physical.impl.BaseRootExec.next():94 org.apache.drill.exec.work.fragment.FragmentExecutor $ 1.run():232 org.apache.drill.exec.work.fragment.FragmentExecutor $ 1 .run():226 java.security.AccessController.doPrivileged(): - 2 javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.UserGroupInformation.doAs():1657 org.apache。 drill.exec.work.fragment.FragmentExecutor.run():226 org.apache.drill.common.SelfCleaningRunnable.run():38 java.util.concurrent.ThreadPoolExecutor.runWorker():1142 java.util.concurrent.ThreadPoolExecutor $ Worker.run():617 java.lang.Thread.run():745

是否有办法在Apache Drill中处理此类数据?

1 个答案:

答案 0 :(得分:0)

@OP 我知道这是一篇老文章,上周我如何使用新的数据提要遇到这个挑战。

直接在Apache Drill(MapR版本)中,我使用STRING_BINARY()转换了cp1252集。 不是优雅或有效的解决方案,但它可以工作。

apache drill 1.10.0 "drill baby drill" 0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> use sys; +-------+----------------------------------+ | ok | summary | +-------+----------------------------------+ | true | Default schema changed to [sys] | +-------+----------------------------------+ 1 row selected (0.975 seconds) 0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> select version from version; +----------+ | version | +----------+ | 1.10.0 | +----------+ 1 row selected (0.409 seconds) 0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> 0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> select * from users.`sbalas002c`.drill_spl_char; +------------------------+--------------------------------------------------------------+ | ORIG_CAMPAIGN_LINE_ID | ORIG_CAMPAIGN_LINE_NAME | +------------------------+--------------------------------------------------------------+ | 30092278 | 1573256-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_SSEA | | 30092282 | 1573257-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_WORD | | 30092286 | 1573254-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_BLIS | | 30092290 | 1573255-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_NSEA | +------------------------+--------------------------------------------------------------+ 4 rows selected (0.445 seconds) 0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> 0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> select ORIG_CAMPAIGN_LINE_NAME, . . . . . . . . . . . . . . . . . . . . . . .> substr(ORIG_CAMPAIGN_LINE_NAME,1,4) sub_CAMPAIGN_LINE_NAME . . . . . . . . . . . . . . . . . . . . . . .> from users.`sbalas002c`.drill_spl_char; Error: SYSTEM ERROR: DrillRuntimeException: Unexpected byte 0xa0 at position 36 encountered while decoding UTF8 string. Fragment 0:0 [Error Id: 1889163a-f847-48ad-a7a9-bbe4284e112c on titand-ch2-p20.cable.comcast.com:31010] (state=,code=0) 0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> 0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> select ORIG_CAMPAIGN_LINE_NAME, . . . . . . . . . . . . . . . . . . . . . . .> STRING_BINARY(ORIG_CAMPAIGN_LINE_NAME) SB_CAMPAIGN_LINE_NAME, . . . . . . . . . . . . . . . . . . . . . . .> regexp_replace(STRING_BINARY(ORIG_CAMPAIGN_LINE_NAME),'\\xA0','') Good_CAMPAIGN_LINE_NAME . . . . . . . . . . . . . . . . . . . . . . .> from users.`sbalas002c`.drill_spl_char; +--------------------------------------------------------------+-----------------------------------------------------------------+-------------------------------------------------------------+ | ORIG_CAMPAIGN_LINE_NAME | SB_CAMPAIGN_LINE_NAME | Good_CAMPAIGN_LINE_NAME | +--------------------------------------------------------------+-----------------------------------------------------------------+-------------------------------------------------------------+ | 1573256-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_SSEA | 1573256-1_306774_SeattleTheatreGroup\xA0_201901_ISV_SEA_Z_SSEA | 1573256-1_306774_SeattleTheatreGroup_201901_ISV_SEA_Z_SSEA | | 1573257-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_WORD | 1573257-1_306774_SeattleTheatreGroup\xA0_201901_ISV_SEA_Z_WORD | 1573257-1_306774_SeattleTheatreGroup_201901_ISV_SEA_Z_WORD | | 1573254-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_BLIS | 1573254-1_306774_SeattleTheatreGroup\xA0_201901_ISV_SEA_Z_BLIS | 1573254-1_306774_SeattleTheatreGroup_201901_ISV_SEA_Z_BLIS | | 1573255-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_NSEA | 1573255-1_306774_SeattleTheatreGroup\xA0_201901_ISV_SEA_Z_NSEA | 1573255-1_306774_SeattleTheatreGroup_201901_ISV_SEA_Z_NSEA | +--------------------------------------------------------------+-----------------------------------------------------------------+-------------------------------------------------------------+ 4 rows selected (0.64 seconds) 0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> Hope this helps others.