Word,PDF文档解析 - Hadoop / in-general Java

时间:2014-07-31 09:05:41

标签: java hadoop apache-poi text-parsing apache-tika

我的目标是将MS-Word,PDF等文档加载到HDFS上并提取某些内容'从每个文档中删除并进一步用于某些分析。

我认为可以使用像Tika这样的库并将其合并到MR中,而不是开始摆弄InputFormat等。

其中一个Word文档的部分内容。如下:

6.  Statement of Strategy 
We have 4 strategic interventions that will deliver a competitive advantage.
 Innovate upstream and downstream
1.  Biopulp.
We will execute Biopulp initially in corrugate for Haircare in China. This will validate the operational process of enzymatically converting straw into pulp and paper. Then we will establish a Joint Development with Family care to extend the sources of value. And finally re-apply globally including across for other sectors and customers to maximize the value generation.
2.  Mandrel Case Forming
We will extend the use of MCF technology within WE for the businesses that already use MCF cases. (i.e. F&HC). In parallel we will establish this as the global standard for HDL’s and HDW. We will seek additional suppliers to execute this technology in other regions (e.g. NA and Asia) to increase capacity and reduce cost of execution.

Supplier Strategy for Competition
3.  Competition in practice 
We have used negotiation as the primary process for establishing prices and supply agreements. We will more effectively create and utilize competition by using enquiries for each of our plants. This may require that we trigger new investments and qualify additional facilities, but with the consolidation going on in the industry it should not cause a net increase in suppliers.
4.  Cost input pass-through
Our current agreements in general use paper as the primary driver of our feedstock clauses. If paper prices go up then our suppliers are happy and we are not. If paper prices go down then we are happy and our suppliers’ are not. This means that almost 100% of the time one party is not happy. If we change our pass-through clauses to be driven by our suppliers’ input costs, then we align ourselves with their interests which will generate less transaction cost and increase collaboration

Optimum Sourcing Principles for Corrugates

<A TABLE HERE>

7.  Tactical Planning and Execution

<A TABLE HERE>

假设,我希望执行以下操作:

  1. 根据“波兰人的最佳采购原则”
  2. 提取表格
  3. “创新上游和下游”
  4. 下的要点

    虽然这看起来很疯狂和荒谬,但我想知道Tika(试过这个但是只用元数据和文件作为字符串),Lucene / Solr,POI等可以帮助解析和理解&#39; Word,PDF文档,允许根据某些搜索字符串(或正则表达式)提取数据块。

    例如,我使用了Tika Parser并获得了以下输出,这些输出过于天真(&#39;这里有一个表格,即Word文档中的一个表格被解释为段落!)以帮助提取内容:< / p>

    6.  Statement of Strategy 
    We have 4 strategic interventions that will deliver a competitive advantage to P&G.
     Innovate upstream and downstream
    Biopulp.
    We will execute Biopulp initially in corrugate for Haircare in China. This will validate the operational process of enzymatically converting straw into pulp and paper. Then we will establish a Joint Development with Family care to extend the sources of value. And finally re-apply globally including across for other sectors and customers to maximize the value generation.
    Mandrel Case Forming
    We will extend the use of MCF technology within WE for the businesses that already use MCF cases. (i.e. F&HC). In parallel we will establish this as the global standard for HDL’s and HDW. We will seek additional suppliers to execute this technology in other regions (e.g. NA and Asia) to increase capacity and reduce cost of execution.
    
    Supplier Strategy for Competition
    Competition in practice 
    We have used negotiation as the primary process for establishing prices and supply agreements. We will more effectively create and utilize competition by using enquiries for each of our plants. This may require that we trigger new investments and qualify additional facilities, but with the consolidation going on in the industry it should not cause a net increase in suppliers.
    Cost input pass-through
    Our current agreements in general use paper as the primary driver of our feedstock clauses. If paper prices go up then our suppliers are happy and we are not. If paper prices go down then we are happy and our suppliers’ are not. This means that almost 100% of the time one party is not happy. If we change our pass-through clauses to be driven by our suppliers’ input costs, then we align ourselves with their interests which will generate less transaction cost and increase collaboration.
    
    
    
    
    
    Optimum Sourcing Principles for Corrugates
        principle
        optimum
        rationale
    
        Number of  suppliers
        2-3 per plant
    >80% with 5 per region/country cluster
        Competition is local
    Scale the spend with central accounts
    
        Global/local suppliers
        Regional is sufficient
        No advantage to global as scale is regional only and there is limited IP to transfer.
    Larger regional suppliers can consolidate local single-plant suppliers to make it efficient for us. They also bring capital for machinery upgrading and scale for paper source.
    
        Approach to suppliers
        collaborative
        Competition to drive price is clear; preferential and value-add deals require collaboration
    
        Make v buy
        buy
        Multiple suppliers; commoditised technologies
    
        Distance of suppliers to plant
        Max 300km for boxes (300miles in NA); up to 1000km for paper reels.
    Can be longer for specialist print grades or to countries with no high quality local supply
        Economic max as high volume product (air in the fluting)
    Need recent built paper machines to produce paper strong enough to run on high-speed corrugators
    
        Type of suppliers
        Integrated with containerboard making
    
    Corrugators on-site
        To assure supply and avoid being leveraged by paper making scale
    Cost structure not competitive if have to buy in board (shipping air)
    
        Purchase of feedstocks
        Not if integrated suppliers
        Integrated suppliers have 20x our scale
    
        Length and nature of contracts
        Multiple year (2-3), but with fixed glidepath pricing/value every year
        Significant effort for Purchases to re-enquire annually. High number of specs and low resources mean long time to qualify relative to additional value if only 12 month allocation.
    
        Specifications
        Standard board weights
    
    
    Tailored box sizes
        Paper scale much higher so uneconomic to make tailored weight
    Maximising pallet fit delivers better savings and stronger pallet (less transport damages) than scale savings of standard box size.
    
        Terms
        Standard, including payment terms
        High degree of competition, no specialist investment. Paper making has good cash-flow, so no need for shorter payment terms.
    

    下面是我编写的示例 TIKA代码(当文档的不同类型(pdf,MS-Word等)到达时,我无法弄清楚如何执行上述操作

    private void parseFileForContent(String absolutePath) throws IOException,
                SAXException, TikaException {
            // TODO Auto-generated method stub
    
            System.out.println("absolutePath : " + absolutePath);
    
            Tika tika = new Tika();
    
            File path = new File(absolutePath);
    
            if (path.isDirectory()) {
    
                File[] files = path.listFiles();
    
                for (File file : files) {
    
                    System.out.println("File type is " + tika.detect(file));
                }
            } else {
                System.out.println("File type is " + tika.detect(path));
    
                Parser parser = new AutoDetectParser();
    
                ContentHandler handler = new BodyContentHandler();
                Metadata metadata = new Metadata();
    
                parser.parse(TikaInputStream.get(path), handler, metadata,
                        new ParseContext());
    
                //displayMetadata(metadata);
    
                System.out.println("Handler "+handler.toString());
            }
    
        }
    

    我希望使用Tika,因为Apache POI仅限于MS文档,但我可以使用POI做一些合理的事情,例如提取段落,表格等。

    package com.lnt.sap.sp2.scratchpad;
    
    import java.io.FileInputStream;
    import java.io.FileNotFoundException;
    import java.io.IOException;
    import java.util.Iterator;
    import java.util.List;
    
    import org.apache.poi.xwpf.usermodel.IBodyElement;
    import org.apache.poi.xwpf.usermodel.XWPFDocument;
    import org.apache.poi.xwpf.usermodel.XWPFParagraph;
    import org.apache.poi.xwpf.usermodel.XWPFTable;
    import org.apache.poi.xwpf.usermodel.XWPFTableCell;
    import org.apache.poi.xwpf.usermodel.XWPFTableRow;
    
    public class POIScratchpad {
    
        public static void main(String[] args) {
            // TODO Auto-generated method stub
    
            String absolutePath = args[0];
    
            POIScratchpad poiScratchpad = new POIScratchpad();
    
            poiScratchpad.parseMSDocuments(absolutePath);
        }
    
        private void parseMSDocuments(String absolutePath) {
            // TODO Auto-generated method stub
    
            try {
    
                XWPFDocument doc = new XWPFDocument(new FileInputStream(
                        absolutePath));
    
                displayElements(doc);
                // displayParagraphs(doc);
                // displayTables(doc);
    
            } catch (FileNotFoundException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
        }
    
        private void displayElements(XWPFDocument doc) {
            // TODO Auto-generated method stub
    
            java.util.Iterator<IBodyElement> bodyElementIterator = doc
                    .getBodyElementsIterator();
    
            int cnt = 0;
    
            while (bodyElementIterator.hasNext()) {
                IBodyElement element = bodyElementIterator.next();
    
                System.out.println("**********" + cnt + "**********");
    
                System.out.println("Element type is " + element.getElementType());
                System.out.println("Part is : " + element.getPart());
                System.out.println("Part Type is : " + element.getPartType());
                System.out.println("Body is : " + element.getBody());
                System.out.println("element is " + element);
    
                System.out.println("**********");
    
                cnt++;
            }
        }
    
        private void displayParagraphs(XWPFDocument doc) {
            // TODO Auto-generated method stub
            List<XWPFParagraph> paragraphs = doc.getParagraphs();
    
            int cnt = 0;
    
            for (XWPFParagraph paragraph : paragraphs) {
    
                System.out.println("**********" + cnt + "**********");
                System.out.println(paragraph.getParagraphText());
                System.out.println("********************");
    
                cnt++;
            }
        }
    
        private void displayTables(XWPFDocument doc) {
            // TODO Auto-generated method stub
    
            Iterator<XWPFTable> tableIterator = doc.getTablesIterator();
    
            int cnt = 0;
    
            while (tableIterator.hasNext()) {
    
                XWPFTable table = tableIterator.next();
    
                System.out.println("**********" + cnt + "**********");
    
                List<XWPFTableRow> rows = table.getRows();
    
                for (XWPFTableRow row : rows) {
    
                    List<XWPFTableCell> cells = row.getTableCells();
    
                    for (XWPFTableCell cell : cells) {
                        System.out.println(cell.getText());
                    }
                }
    
                System.out.println("********************");
    
                cnt++;
            }
        }
    }
    

    我该怎么办?我的假设在哪里不切实际或需要文件中的更多信息?

0 个答案:

没有答案