我们正在尝试解析我们的PDF文件,然后再将其复制到HDFS位置。实际上,我们期望有一些解析PDF文件的标准方法。
到目前为止,我们尝试了以下两种不同类型的软件包,结果不一致。
from tika import parser
parserPDF = parser.from_file("sample.pdf")
tika示例代码
<%@ tag body-content="empty" trimDirectiveWhitespaces="true"%>
<%@ taglib prefix="form" uri="http://www.springframework.org/tags/form"%>
<%@ taglib prefix="spring" uri="http://www.springframework.org/tags"%>
<%@ taglib prefix="c" uri="http://java.sun.com/jsp/jstl/core"%>
<c:url value="/setSubscribe" var="subscribe"></c:url>
<form:form action="${subscribe}" method="post" modelAttribute="subscribeForm">
<input type="hidden" name="${_csrf.parameterName}" value="${_csrf.token}"/>
<label class="form-check-label radio-inline">
<input type="radio" class="form-check-input" value="gender" name="gender" value="Male"/> Male
</label>
<label class="form-check-label radio-inline">
<input type="radio" class="form-check-input" value="gender" name="gender" value="Female"/> Female
</label>
<br></br>
<div class="row">
<div class="col-sm-3">
<input class="form-control" name="firstName" placeholder="First Name"></input>
</div>
<div class="col-sm-3">
<input class="form-control" name="lastName" placeholder="Last Name"></input>
</div>
<div class="col-sm-3">
<input class="form-control" name="email" placeholder="Your Email Address"></input>
</div>
<div class="col-sm-3">
<button type="submit" class="btn btn-primary">SUBSCRIBE</button>
</div>
</div>
</form:form>
将PDF解析为基本文本的标准方法是什么?