我尝试编写PIG eval函数(UDF)以使用Apache Tika从pdf文件中提取文本。但是,每当我尝试运行该函数时,我的函数只会将0或1个字节写入输出。我怎么能修复我的代码?
public class ExtractTextFromPDFs extends EvalFunc<String> {
@Override
public String exec(Tuple input) throws IOException {
String pdfText;
if (input == null || input.size() == 0 || input.get(0) == null) {
return "N/A";
}
DataByteArray dba = (DataByteArray)input.get(0);
InputStream is = new ByteArrayInputStream(dba.get());
ContentHandler contenthandler = new BodyContentHandler();
Metadata metadata = new Metadata();
Parser pdfparser = new AutoDetectParser();
try {
pdfparser.parse(is, contenthandler, metadata, new ParseContext());
} catch (SAXException | TikaException e) {
e.printStackTrace();
}
pdfText = contenthandler.toString();
//close the input stream
if(is != null){
is.close();
}
return pdfText;
}
}
我使用&#39; c = foreach b生成ExtractTextFromPDFs(内容);&#39;其中b是pdf,内容是bytearray。