我们作为csv的一部分查询的数据包含cp1252字符代码,apache drill提供以下错误:
org.apache.drill.common.exceptions.UserRemoteException:SYSTEM ERROR:MalformedInputException:输入长度= 1片段0:0 [错误ID:53bc07e3-a6e4-4301-a858-205be382275e on 172.16.243.116:31010](java .lang.RuntimeException)java.nio.charset.MalformedInputException:输入长度= 1 org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper.decodeUT8():185 org.apache.drill.exec.expr.fn。 impl.CharSequenceWrapper.setBuffer():119 org.apache.drill.exec.test.generated.FiltererGen174.doEval():50 org.apache.drill.exec.test.generated.FiltererGen174.filterBatchNoSV():100 org.apache .drill.exec.test.generated.FiltererGen174.filterBatch():73 org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.doWork():81 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext ():93 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch。 next():109 org.apache.drill.exec.record.AbstractSi ngleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext():115 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill .exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache。 drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():93 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next(): 119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.project.ProjectRecordBatch .innerNext():135 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.physical.impl.BaseRootExec.next():104 org.apache.drill.exec。 physical.impl.ScreenCreator $ ScreenRoot.innerNext():81 org.apache.drill.exec.physical.impl.Ba seRootExec.next():94 org.apache.drill.exec.work.fragment.FragmentExecutor $ 1.run():232 org.apache.drill.exec.work.fragment.FragmentExecutor $ 1.run():226 java.security .AccessController.doPrivileged(): - 2 javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.UserGroupInformation.doAs():1657 org.apache.drill.exec.work.fragment.FragmentExecutor .run():226 org.apache.drill.common.SelfCleaningRunnable.run():38 java.util.concurrent.ThreadPoolExecutor.runWorker():1142 java.util.concurrent.ThreadPoolExecutor $ Worker.run():617 java .lang.Thread.run():745引起者(java.nio.charset.MalformedInputException)输入长度= 1 java.nio.charset.CoderResult.throwException():281 org.apache.drill.exec.expr.fn。 impl.CharSequenceWrapper.decodeUT8():183 org.apache.drill.exec.expr.fn.impl.CharSequenceWrapper.setBuffer():119 org.apache.drill.exec.test.generated.FiltererGen174.doEval():50 org .apache.drill.exec.test.generated.FiltererGen174.filterBatchNoSV():100 org.apache.drill.exec.test.generated.Filter erGen174.filterBatch():73 org.apache.drill.exec.physical.impl.filter.FilterRecordBatch.doWork():81 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():93 org.apache.drill .exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache。 drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext():115 org.apache.drill.exec.record.AbstractRecordBatch.next(): 162 org.apache.drill.exec.record.AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next():109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext() :51 org.apache.drill.exec.physical.impl.svremover.RemovingRecordBatch.innerNext():93 org.apache.drill.exec.record.AbstractRecordBatch.next():162 org.apache.drill.exec.record。 AbstractRecordBatch.next():119 org.apache.drill.exec.record.AbstractRecordBatch.next() :109 org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51 org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():135 org.apache.drill.exec.record。 AbstractRecordBatch.next():162 org.apache.drill.exec.physical.impl.BaseRootExec.next():104 org.apache.drill.exec.physical.impl.ScreenCreator $ ScreenRoot.innerNext():81 org.apache .drill.exec.physical.impl.BaseRootExec.next():94 org.apache.drill.exec.work.fragment.FragmentExecutor $ 1.run():232 org.apache.drill.exec.work.fragment.FragmentExecutor $ 1 .run():226 java.security.AccessController.doPrivileged(): - 2 javax.security.auth.Subject.doAs():422 org.apache.hadoop.security.UserGroupInformation.doAs():1657 org.apache。 drill.exec.work.fragment.FragmentExecutor.run():226 org.apache.drill.common.SelfCleaningRunnable.run():38 java.util.concurrent.ThreadPoolExecutor.runWorker():1142 java.util.concurrent.ThreadPoolExecutor $ Worker.run():617 java.lang.Thread.run():745
是否有办法在Apache Drill中处理此类数据?
答案 0 :(得分:0)
@OP
我知道这是一篇老文章,上周我如何使用新的数据提要遇到这个挑战。
直接在Apache Drill(MapR版本)中,我使用STRING_BINARY()转换了cp1252集。
不是优雅或有效的解决方案,但它可以工作。
apache drill 1.10.0
"drill baby drill"
0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> use sys;
+-------+----------------------------------+
| ok | summary |
+-------+----------------------------------+
| true | Default schema changed to [sys] |
+-------+----------------------------------+
1 row selected (0.975 seconds)
0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> select version from version;
+----------+
| version |
+----------+
| 1.10.0 |
+----------+
1 row selected (0.409 seconds)
0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT>
0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> select * from users.`sbalas002c`.drill_spl_char;
+------------------------+--------------------------------------------------------------+
| ORIG_CAMPAIGN_LINE_ID | ORIG_CAMPAIGN_LINE_NAME |
+------------------------+--------------------------------------------------------------+
| 30092278 | 1573256-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_SSEA |
| 30092282 | 1573257-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_WORD |
| 30092286 | 1573254-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_BLIS |
| 30092290 | 1573255-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_NSEA |
+------------------------+--------------------------------------------------------------+
4 rows selected (0.445 seconds)
0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT>
0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> select ORIG_CAMPAIGN_LINE_NAME,
. . . . . . . . . . . . . . . . . . . . . . .> substr(ORIG_CAMPAIGN_LINE_NAME,1,4) sub_CAMPAIGN_LINE_NAME
. . . . . . . . . . . . . . . . . . . . . . .> from users.`sbalas002c`.drill_spl_char;
Error: SYSTEM ERROR: DrillRuntimeException: Unexpected byte 0xa0 at position 36 encountered while decoding UTF8 string.
Fragment 0:0
[Error Id: 1889163a-f847-48ad-a7a9-bbe4284e112c on titand-ch2-p20.cable.comcast.com:31010] (state=,code=0)
0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT>
0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT> select ORIG_CAMPAIGN_LINE_NAME,
. . . . . . . . . . . . . . . . . . . . . . .> STRING_BINARY(ORIG_CAMPAIGN_LINE_NAME) SB_CAMPAIGN_LINE_NAME,
. . . . . . . . . . . . . . . . . . . . . . .> regexp_replace(STRING_BINARY(ORIG_CAMPAIGN_LINE_NAME),'\\xA0','') Good_CAMPAIGN_LINE_NAME
. . . . . . . . . . . . . . . . . . . . . . .> from users.`sbalas002c`.drill_spl_char;
+--------------------------------------------------------------+-----------------------------------------------------------------+-------------------------------------------------------------+
| ORIG_CAMPAIGN_LINE_NAME | SB_CAMPAIGN_LINE_NAME | Good_CAMPAIGN_LINE_NAME |
+--------------------------------------------------------------+-----------------------------------------------------------------+-------------------------------------------------------------+
| 1573256-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_SSEA | 1573256-1_306774_SeattleTheatreGroup\xA0_201901_ISV_SEA_Z_SSEA | 1573256-1_306774_SeattleTheatreGroup_201901_ISV_SEA_Z_SSEA |
| 1573257-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_WORD | 1573257-1_306774_SeattleTheatreGroup\xA0_201901_ISV_SEA_Z_WORD | 1573257-1_306774_SeattleTheatreGroup_201901_ISV_SEA_Z_WORD |
| 1573254-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_BLIS | 1573254-1_306774_SeattleTheatreGroup\xA0_201901_ISV_SEA_Z_BLIS | 1573254-1_306774_SeattleTheatreGroup_201901_ISV_SEA_Z_BLIS |
| 1573255-1_306774_SeattleTheatreGroup�_201901_ISV_SEA_Z_NSEA | 1573255-1_306774_SeattleTheatreGroup\xA0_201901_ISV_SEA_Z_NSEA | 1573255-1_306774_SeattleTheatreGroup_201901_ISV_SEA_Z_NSEA |
+--------------------------------------------------------------+-----------------------------------------------------------------+-------------------------------------------------------------+
4 rows selected (0.64 seconds)
0: jdbc:drill:zk=titana-ch2-p3:5181/drill/TIT>
Hope this helps others.