Tika或JAXP或两者兼而有之

时间:2014-08-05 11:53:56

标签: xpath apache-tika jaxp javax.xml

请参考background thread以更好地了解我的困境;)

正如上面提到的那样,我决定使用Tika来创建一个解析文档的通用接口。并提取内容。现在,我决定使用适当的ContentHandler将每个文档转换为XML / HTML。

以下是示例输出:

    File type is application/vnd.openxmlformats-officedocument.wordprocessingml.document
    Handler <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <meta name="cp:revision" content="2" />
    <meta name="meta:last-author" content="ogilvie.f" />
    <meta name="Last-Author" content="ogilvie.f" />
    <meta name="meta:save-date" content="2012-04-24T15:24:00Z" />
    <meta name="Application-Name" content="Microsoft Office Word" />
    <meta name="Author" content="ogilvie.f" />
    <meta name="dcterms:created" content="2012-04-24T15:24:00Z" />
    <meta name="Application-Version" content="12.0000" />
    <meta name="Character-Count-With-Spaces" content="21667" />
    <meta name="date" content="2012-04-24T15:24:00Z" />
    <meta name="extended-properties:Template" content="Normal" />
    <meta name="meta:line-count" content="153" />
    <meta name="creator" content="ogilvie.f" />
    <meta name="publisher" content="Procter &amp; Gamble" />
    <meta name="Word-Count" content="3240" />
    <meta name="meta:paragraph-count" content="43" />
    <meta name="Creation-Date" content="2012-04-24T15:24:00Z" />
    <meta name="extended-properties:AppVersion" content="12.0000" />
    <meta name="meta:author" content="ogilvie.f" />
    <meta name="Line-Count" content="153" />
    <meta name="extended-properties:Application" content="Microsoft Office Word" />
    <meta name="Paragraph-Count" content="43" />
    <meta name="Last-Save-Date" content="2012-04-24T15:24:00Z" />
    <meta name="Last-Printed" content="2012-03-29T15:06:00Z" />
    <meta name="Revision-Number" content="2" />
    <meta name="meta:print-date" content="2012-03-29T15:06:00Z" />
    <meta name="meta:creation-date" content="2012-04-24T15:24:00Z" />
    <meta name="dcterms:modified" content="2012-04-24T15:24:00Z" />
    <meta name="Template" content="Normal" />
    <meta name="Page-Count" content="15" />
    <meta name="meta:character-count" content="18470" />
    <meta name="dc:creator" content="ogilvie.f" />
    <meta name="meta:word-count" content="3240" />
    <meta name="extended-properties:Company" content="Procter &amp; Gamble" />
    <meta name="Last-Modified" content="2012-04-24T15:24:00Z" />
    <meta name="custom:ContentTypeId" content="0x010100832DCE57D1DD144A851051A25C75E147" />
    <meta name="modified" content="2012-04-24T15:24:00Z" />
    <meta name="xmpTPg:NPages" content="15" />
    <meta name="dc:publisher" content="Procter &amp; Gamble" />
    <meta name="Character Count" content="18470" />
    <meta name="meta:page-count" content="15" />
    <meta name="meta:character-count-with-spaces" content="21667" />
    <meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.wordprocessingml.document" />
    <title></title>
    </head>
    <body><p class="body_Text"><b>CONFIDENTIAL</b></p>
    <table><tbody><tr>  <td><p>principle</p>
</td>   <td><p>optimum</p>
</td>   <td><p>rationale</p>
</td></tr>
<tr>    <td><p>Number of  suppliers</p>
</td>   <td><p class="list_Paragraph">2-3 per plant</p>
<p class="list_Paragraph">&gt;80% with 5 per region/country cluster</p>
</td>   <td><p class="list_Paragraph">Competition is local</p>
<p class="list_Paragraph">Scale the spend with central accounts</p>
</td></tr>
<tr>    <td><p>Global/local suppliers</p>
</td>   <td><p>Regional is sufficient</p>
</td>   <td><p class="list_Paragraph">No advantage to global as scale is regional only and there is limited IP to transfer.</p>
<p class="list_Paragraph">Larger regional suppliers can consolidate local single-plant suppliers to make it efficient for us. They also bring capital for machinery upgrading and scale for paper source.</p>
</td></tr>
<tr>    <td><p>Approach to suppliers</p>
</td>   <td><p>collaborative</p>
</td>   <td><p>Competition to drive price is clear; preferential and value-add deals require collaboration</p>
</td></tr>
<tr>    <td><p>Make v buy</p>
</td>   <td><p>buy</p>
</td>   <td><p>Multiple suppliers; commoditised technologies</p>
</td></tr>
<tr>    <td><p>Distance of suppliers to plant</p>
</td>   <td><p class="list_Paragraph">Max 300km for boxes (300miles in NA); up to 1000km for paper reels.</p>
<p class="list_Paragraph">Can be longer for specialist print grades or to countries with no high quality local supply</p>
</td>   <td><p class="list_Paragraph">Economic max as high volume product (air in the fluting)</p>
<p class="list_Paragraph">Need recent built paper machines to produce paper strong enough to run on high-speed corrugators</p>
</td></tr>
<tr>    <td><p>Type of suppliers</p>
</td>   <td><p class="list_Paragraph">Integrated with containerboard making</p>
<p />
<p class="list_Paragraph">Corrugators on-site</p>
</td>   <td><p class="list_Paragraph">To assure supply and avoid being leveraged by paper making scale</p>
<p class="list_Paragraph">Cost structure not competitive if have to buy in board (shipping air)</p>
</td></tr>
<tr>    <td><p>Purchase of feedstocks</p>
</td>   <td><p>Not if integrated suppliers</p>
</td>   <td><p>Integrated suppliers have 20x our scale</p>
</td></tr>
<tr>    <td><p>Length and nature of contracts</p>
</td>   <td><p>Multiple year (2-3), but with fixed glidepath pricing/value every year</p>
</td>   <td><p>Significant effort for Purchases to re-enquire annually. High number of specs and low resources mean long time to qualify relative to additional value if only 12 month allocation.</p>
</td></tr>
<tr>    <td><p>Specifications</p>
</td>   <td><p class="list_Paragraph">Standard board weights</p>
<p />
<p />
<p class="list_Paragraph">Tailored box sizes</p>
</td>   <td><p class="list_Paragraph">Paper scale much higher so uneconomic to make tailored weight</p>
<p class="list_Paragraph">Maximising pallet fit delivers better savings and stronger pallet (less transport damages) than scale savings of standard box size.</p>
</td></tr>
<tr>    <td><p>Terms</p>
</td>   <td><p>Standard, including payment terms</p>
</td>   <td><p>High degree of competition, no specialist investment. Paper making has good cash-flow, so no need for shorter payment terms.</p>
</td></tr>
</tbody></table>
    <p>date</p>
    </td></tr>
    </tbody></table>
    <p />
    <p />
    <p>1</p>
    <p class="footer" />
    </body></html>

当我想从处理程序中提取元素时,挑战就开始了。我被建议使用XPath并通过正则表达式得到表。我得到了这个概念但是无法使用Tika 作为explained here

在阅读threads like this之后,我想知道我是否应该完全退出Tika以支持JAXP或使用组合(?)。

任何人都可以指导我的假设,方向错误以及我应该如何进行?

0 个答案:

没有答案