Question

我的目标是将MS-Word，PDF等文档加载到HDFS上并提取某些内容＆＃39;从每个文档中删除并进一步用于某些分析。

我认为可以使用像Tika这样的库并将其合并到MR中，而不是开始摆弄InputFormat等。

其中一个Word文档的部分内容。如下：

6.  Statement of Strategy 
We have 4 strategic interventions that will deliver a competitive advantage.
 Innovate upstream and downstream
1.  Biopulp.
We will execute Biopulp initially in corrugate for Haircare in China. This will validate the operational process of enzymatically converting straw into pulp and paper. Then we will establish a Joint Development with Family care to extend the sources of value. And finally re-apply globally including across for other sectors and customers to maximize the value generation.
2.  Mandrel Case Forming
We will extend the use of MCF technology within WE for the businesses that already use MCF cases. (i.e. F&HC). In parallel we will establish this as the global standard for HDL’s and HDW. We will seek additional suppliers to execute this technology in other regions (e.g. NA and Asia) to increase capacity and reduce cost of execution.

Supplier Strategy for Competition
3.  Competition in practice 
We have used negotiation as the primary process for establishing prices and supply agreements. We will more effectively create and utilize competition by using enquiries for each of our plants. This may require that we trigger new investments and qualify additional facilities, but with the consolidation going on in the industry it should not cause a net increase in suppliers.
4.  Cost input pass-through
Our current agreements in general use paper as the primary driver of our feedstock clauses. If paper prices go up then our suppliers are happy and we are not. If paper prices go down then we are happy and our suppliers’ are not. This means that almost 100% of the time one party is not happy. If we change our pass-through clauses to be driven by our suppliers’ input costs, then we align ourselves with their interests which will generate less transaction cost and increase collaboration

Optimum Sourcing Principles for Corrugates

<A TABLE HERE>

7.  Tactical Planning and Execution

<A TABLE HERE>

假设，我希望执行以下操作：

根据“波兰人的最佳采购原则”
“创新上游和下游”

虽然这看起来很疯狂和荒谬，但我想知道Tika（试过这个但是只用元数据和文件作为字符串），Lucene / Solr，POI等可以帮助解析和理解＆＃39; Word，PDF文档，允许根据某些搜索字符串（或正则表达式）提取数据块。

例如，我使用了Tika Parser并获得了以下输出，这些输出过于天真（＆＃39;这里有一个表格，即Word文档中的一个表格被解释为段落！）以帮助提取内容：< / p>

6.  Statement of Strategy 
We have 4 strategic interventions that will deliver a competitive advantage to P&G.
 Innovate upstream and downstream
Biopulp.
We will execute Biopulp initially in corrugate for Haircare in China. This will validate the operational process of enzymatically converting straw into pulp and paper. Then we will establish a Joint Development with Family care to extend the sources of value. And finally re-apply globally including across for other sectors and customers to maximize the value generation.
Mandrel Case Forming
We will extend the use of MCF technology within WE for the businesses that already use MCF cases. (i.e. F&HC). In parallel we will establish this as the global standard for HDL’s and HDW. We will seek additional suppliers to execute this technology in other regions (e.g. NA and Asia) to increase capacity and reduce cost of execution.

Supplier Strategy for Competition
Competition in practice 
We have used negotiation as the primary process for establishing prices and supply agreements. We will more effectively create and utilize competition by using enquiries for each of our plants. This may require that we trigger new investments and qualify additional facilities, but with the consolidation going on in the industry it should not cause a net increase in suppliers.
Cost input pass-through
Our current agreements in general use paper as the primary driver of our feedstock clauses. If paper prices go up then our suppliers are happy and we are not. If paper prices go down then we are happy and our suppliers’ are not. This means that almost 100% of the time one party is not happy. If we change our pass-through clauses to be driven by our suppliers’ input costs, then we align ourselves with their interests which will generate less transaction cost and increase collaboration.





Optimum Sourcing Principles for Corrugates
    principle
    optimum
    rationale

    Number of  suppliers
    2-3 per plant
>80% with 5 per region/country cluster
    Competition is local
Scale the spend with central accounts

    Global/local suppliers
    Regional is sufficient
    No advantage to global as scale is regional only and there is limited IP to transfer.
Larger regional suppliers can consolidate local single-plant suppliers to make it efficient for us. They also bring capital for machinery upgrading and scale for paper source.

    Approach to suppliers
    collaborative
    Competition to drive price is clear; preferential and value-add deals require collaboration

    Make v buy
    buy
    Multiple suppliers; commoditised technologies

    Distance of suppliers to plant
    Max 300km for boxes (300miles in NA); up to 1000km for paper reels.
Can be longer for specialist print grades or to countries with no high quality local supply
    Economic max as high volume product (air in the fluting)
Need recent built paper machines to produce paper strong enough to run on high-speed corrugators

    Type of suppliers
    Integrated with containerboard making

Corrugators on-site
    To assure supply and avoid being leveraged by paper making scale
Cost structure not competitive if have to buy in board (shipping air)

    Purchase of feedstocks
    Not if integrated suppliers
    Integrated suppliers have 20x our scale

    Length and nature of contracts
    Multiple year (2-3), but with fixed glidepath pricing/value every year
    Significant effort for Purchases to re-enquire annually. High number of specs and low resources mean long time to qualify relative to additional value if only 12 month allocation.

    Specifications
    Standard board weights


Tailored box sizes
    Paper scale much higher so uneconomic to make tailored weight
Maximising pallet fit delivers better savings and stronger pallet (less transport damages) than scale savings of standard box size.

    Terms
    Standard, including payment terms
    High degree of competition, no specialist investment. Paper making has good cash-flow, so no need for shorter payment terms.

下面是我编写的示例 TIKA代码（当文档的不同类型（pdf，MS-Word等）到达时，我无法弄清楚如何执行上述操作

private void parseFileForContent(String absolutePath) throws IOException,
            SAXException, TikaException {
        // TODO Auto-generated method stub

        System.out.println("absolutePath : " + absolutePath);

        Tika tika = new Tika();

        File path = new File(absolutePath);

        if (path.isDirectory()) {

            File[] files = path.listFiles();

            for (File file : files) {

                System.out.println("File type is " + tika.detect(file));
            }
        } else {
            System.out.println("File type is " + tika.detect(path));

            Parser parser = new AutoDetectParser();

            ContentHandler handler = new BodyContentHandler();
            Metadata metadata = new Metadata();

            parser.parse(TikaInputStream.get(path), handler, metadata,
                    new ParseContext());

            //displayMetadata(metadata);

            System.out.println("Handler "+handler.toString());
        }

    }

我希望使用Tika，因为Apache POI仅限于MS文档，但我可以使用POI做一些合理的事情，例如提取段落，表格等。

package com.lnt.sap.sp2.scratchpad;

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Iterator;
import java.util.List;

import org.apache.poi.xwpf.usermodel.IBodyElement;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFTable;
import org.apache.poi.xwpf.usermodel.XWPFTableCell;
import org.apache.poi.xwpf.usermodel.XWPFTableRow;

public class POIScratchpad {

    public static void main(String[] args) {
        // TODO Auto-generated method stub

        String absolutePath = args[0];

        POIScratchpad poiScratchpad = new POIScratchpad();

        poiScratchpad.parseMSDocuments(absolutePath);
    }

    private void parseMSDocuments(String absolutePath) {
        // TODO Auto-generated method stub

        try {

            XWPFDocument doc = new XWPFDocument(new FileInputStream(
                    absolutePath));

            displayElements(doc);
            // displayParagraphs(doc);
            // displayTables(doc);

        } catch (FileNotFoundException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    private void displayElements(XWPFDocument doc) {
        // TODO Auto-generated method stub

        java.util.Iterator<IBodyElement> bodyElementIterator = doc
                .getBodyElementsIterator();

        int cnt = 0;

        while (bodyElementIterator.hasNext()) {
            IBodyElement element = bodyElementIterator.next();

            System.out.println("**********" + cnt + "**********");

            System.out.println("Element type is " + element.getElementType());
            System.out.println("Part is : " + element.getPart());
            System.out.println("Part Type is : " + element.getPartType());
            System.out.println("Body is : " + element.getBody());
            System.out.println("element is " + element);

            System.out.println("**********");

            cnt++;
        }
    }

    private void displayParagraphs(XWPFDocument doc) {
        // TODO Auto-generated method stub
        List<XWPFParagraph> paragraphs = doc.getParagraphs();

        int cnt = 0;

        for (XWPFParagraph paragraph : paragraphs) {

            System.out.println("**********" + cnt + "**********");
            System.out.println(paragraph.getParagraphText());
            System.out.println("********************");

            cnt++;
        }
    }

    private void displayTables(XWPFDocument doc) {
        // TODO Auto-generated method stub

        Iterator<XWPFTable> tableIterator = doc.getTablesIterator();

        int cnt = 0;

        while (tableIterator.hasNext()) {

            XWPFTable table = tableIterator.next();

            System.out.println("**********" + cnt + "**********");

            List<XWPFTableRow> rows = table.getRows();

            for (XWPFTableRow row : rows) {

                List<XWPFTableCell> cells = row.getTableCells();

                for (XWPFTableCell cell : cells) {
                    System.out.println(cell.getText());
                }
            }

            System.out.println("********************");

            cnt++;
        }
    }
}

我该怎么办？我的假设在哪里不切实际或需要文件中的更多信息？

Word，PDF文档解析 - Hadoop / in-general Java

0 个答案: