提取pdf并比较pdf中的文本

时间:2019-11-19 00:13:14

标签: java arrays pdf data-extraction

需要帮助!我需要从pdf中提取文本,在每个页面上搜索驱动程序,对其进行比较,如果不匹配则给出错误。到目前为止,我能够提取txt,搜索驱动程序并将它们与整个文档进行比较。但是,我需要分别为每个页面执行此操作。应该不难,但我缺少一些东西。

将值与文本文件中的值进行比较。我确信可以改进此代码,所以如果您有任何想法请告诉我。

谢谢您的帮助!

package com.ds.stack;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Scanner;

public class MaxMinNumber {
    public static void main(String[] args) throws IOException {

        String path = "C:\\Users\\***\\Desktop\\javaPDF 001\\JavaPDF 000\\multuplePDFtest 001\\";
        String files;
        // File inputFile = new (pathToTextFile);
        File folder = new File(path);
        File[] ListOfFiles = folder.listFiles();
        ArrayList<String> listStrings = new ArrayList<>();
        int i;
        int b = 0;
        int r;
        // Extracting drivers from text file
        Scanner s = new Scanner(new File("C:\\Users\\****\\Desktop\\javaPDF 001\\JavaPDF 000\\DriversList.txt"));
        ArrayList<String> driversLibrary = new ArrayList<>();
        while (s.hasNextLine())
            driversLibrary.add(s.nextLine());
        s.close();

        for (File ListOfFile : ListOfFiles) {
            if (ListOfFile.isFile()) {
                files = ListOfFile.getName();
                if (files.startsWith("page") || files.endsWith(".PDF")) {
                    System.out.println(files);
                    String nfiles = "C:\\Users\\IliaS\\Desktop\\javaPDF 001\\JavaPDF 000\\multuplePDFtest 001\\";
                    PDFManager pdfManager = new PDFManager();
                    String pdfToText = pdfManager.pdftoText(nfiles + files);
                    listStrings.add(pdfToText);
                    if (pdfToText == null) {
                        System.out.println("PDF to Text Conversion failed.");
                    } else {
                        System.out.println("\nText from PDF file\n\n" + pdfToText);
                    }
                }
            }

        }
        String[] array2 = driversLibrary.toArray(new String[0]);
        // System.out.println(Arrays.toString(array2));

        ArrayList<String> driversList = new ArrayList<>();
        Object[] array = listStrings.toArray();
        String convertToString = Arrays.toString(array);
        // System.out.println("-----\n"+convertToString+"\n----\n");
        String[] array1 = convertToString.split("\\b+");
        // System.out.println("---------------------");
        // System.out.println (Arrays.toString(array));
        for (i = 0; i < array1.length; i++) {
            for (r = 0; r < array2.length; r++) {
                // System.out.println ("-\n"+array1[i]+"\n-\n");
                if (array1[i].matches(array2[r])) {
                    // System.out.println("There is a driver");
                    driversList.add(array1[i]);
                } else if (array1[i].matches(".*\\b_xxx_\\b.*")) {
                    b = b + 1;

                }
            }

        }

        boolean driversMatchPositiveOutput = false;
        boolean driversMatchNegativeOutput = false;

        for (int j = 0; j < driversList.size(); j++) {
            for (int p = j + 1; p < driversList.size(); p++) {

                if (driversList.get(j).equals(driversList.get(p))) {
                    // System.out.println("Drivers match");
                    driversMatchPositiveOutput = true;
                } else {
                    // System.out.println("Drivers do not match");
                    driversMatchNegativeOutput = true;
                }

            }

        }
        System.out.println("Drivers on page - " + driversList);
        System.out.println("Drivers Library - " + Arrays.toString(array2));
        if ((driversMatchPositiveOutput == true) && (driversMatchNegativeOutput == false)) {
            System.out.println("Drivers match");
        } else if ((b % 2 == 0) && (true == driversMatchNegativeOutput)) {
            System.out.println("Driver is missing");
        } else {
            System.out.println("Drivers do not match");
        }

    }

}


我的输出是

page1.pdf

Text from PDF file

BILL OF MATERIALS SALES ORDER: 89844 (Rev: 000)CONFIRMATION
IMPORTANT: PLEASE READ
Please note that this order will not be scheduled for production until approved shop drawings are received.
Approved Shop drawings must be signed and emailed to your CSR
Page(s) Line Type Qty. Code Rev. # Rev. Date By:
1 1 S4 1 WBSLED-750-80-35-S-16'-AP-UNV-DP-1-SC 0 11/4/2019 MR
Information

PO: 101243
SO: 89844 (Rev: 000)

page2.pdf

Text from PDF file

SHOP DRAWING NOT AVAILABLE. :(
WET BEAM SHOP DRAWINGSurface - Surface Solid Ceiling Type: S4
SECTION VIEW WIRING DIAGRAM JOINING SYSTEM INFORMATION/APPROVAL
 4”
4”
3 1/16” 
FLX-D
Ci
rcu
it 
1
Hot
Neutral
Ground
0-10V
GRY
PPL
GRN
WHT
BLK
Note: Must follow max. wires length for 
Dimming applications

Sales Rep: CSR name not set
PO: 101243 SO: 89844 (Rev: 000)

Fixture Page: 1/1 Document Page: 1/1
Approved By: Approved Date:
NOTES
ORDERING CODE
QTY. PRODUCTID
NOM.
LUMENS/FT CRI
COLOR
TEMP
SHIELDING
DIRECT LENGTH FINISH VOLTAGE DRIVER CIRCUITS MOUNTING
1
WBSLED 750 80 35 S 16' AP UNV DP 1 SC
WBSLED 750 lumens 80 CRI 3500 K Satin Lens CustomLENGTH
Aluminum
Paint Universal
Dimming
(0-10V) 1% 1 Circuit
Surface
Solid Ceiling
satin lens
SIDE VIEW
BOTTOM VIEW
S4-16-0#1  S4-16-0#2
200 3/4" overall length with end caps
967/8" extrusion cut length 96 7/8" extrusion cut length
MAX: 94" between mounting 
points
MAX: 94"  between mounting 
points
3/8" Gap
3 3/32" 3 3/32"
FOR WET LOCATION 
APPLICATION ONLY
JOINER
JOINER

Must be wet location
Water tight connector 

Electrical wet location 
box by Others
POWER FEED
EF END FEED
2L2
LG-C5-2 Cart.
 (890mA)
DR15892A
2L2
LG-C5-2 Cart.
 (890mA)
DR15892A
2L2
LG-C5-2 Cart.
 (890mA)
DR15892A
2L2
LG-C5-2 Cart.
 (890mA)
DR15892A
2L1
LG-C4-2 Cart.
_xxx_
_xxx_


Drivers on page - [DR15892A, DR15892A, DR15892A, DR15892A]
Drivers Library - [DR15892A, DR15892, DR15860]
Driver is missing

0 个答案:

没有答案