Question

我在Azure Blob存储上存储了所有不同类型的文件，文件是txt，doc，pdf等。但是，所有文件都存储为'octet-stream'，当我打开文件以使用Tika从中提取文本时，Tika无法检测到字符编码。我怎样才能解决这个问题？

FileSystem fs = FileSystem.get(new Configuration());            
Path pt = new Path(Configs.BLOBSTORAGEPREFIX+fileAdd);          
InputStream stream = fs.open(pt);           


AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();   

parser.parse(stream, handler, metadata);       


spaceContentBuffer.append(handler.toString());

Answer 1

如果您直接调用Azure存储REST API，则可以设置标题＆＃34; x-ms-blob-content-type＆＃34;通过API Set Blob Properties。

如果您使用的是Azure Storage Client Library，则可以编写如下代码：

blockBlob.Properties.ContentType = "text/xml";
blockBlob.SetProperties();

如何使用Apache Tika解析八位字节流文件？

1 个答案: