Question

我忙于在OpenBravoPOS上进行一些扩展，从我们订购产品的公司阅读我们的发票。

此发票以PDF格式创建。我使用了Itext Library来读取特定的订单行。问题是我能够在一个大字符串中读取我需要的页面。这个字符串看起来像

LEVERINGSBON 30/06/2012 27828/2012/NL/WebShop   Distributeur ID nummer: 15099191 Uw distributeur: Klant Naam: FM Point Marcel Snoeck Adres: Zonnedauw 17 5953MS Reuver Telefoon: +31654317017 E-MAIL: yvonneenmarcel@home.nl Opmerking: -  Lp. Rekening Totaal FV/39525/2012/NL     vd Wal Sandra 72.00 1 3 x 354 - Luxury Collection 50ml NEW! 72.00 FV/39526/2012/NL     Slaats Tim 6.00 2 1 x KR01 - Eye Pencil DECADENCE BLACK 6.00 FV/39527/2012/NL     Nabben Britt 44.95 3 3 x E013 - Krachtreiniger 1000ml 24.75 4 2 x E016 -Tapijtreiniger 1000ml 9.20 5 1 x 3 Step Mascara PERFECT BLACK 11.00 FV/39528/2012/NL     Nabben Lieke 32.00 6 1 x 192 - Luxury Collection 50ml 21.00 7 1 x 3 Step Mascara PERFECT BLACK 11.00 FV/39529/2012/NL     Claessens Patrick 12.40 8 1 x P101 - Peeling VERBENA 12.40 FV/39530/2012/NL     Smits Yolanda 56.00 9 1 x E006 - Wasmiddel VIVID COLOURS 1000ml 7.00 10 2 x B023 - Body Lotion 200ml NEW 18.40 11 2 x 023 - Classic Collection 30ml 30.60 FV/39531/2012/NL     van Pol-Thijssen Silvia 34.70 12 1 x 110 - Classic Collection 50ml 15.30 13 1 x N003 - Nagellak HOT RED 7.00 14 1 x P103 - Peeling CHERRY BLOSSOM 12.40 Aantal: 21 Totaal: 258.05 € 1.17.4564.29482 1/1        "

我尝试做的是读取每一行，并确定这是否是一个订单行，如果是，我需要将它放在数据库中。

一个订单行看起来像

2 1 x KR01 - Eye Pencil DECADENCE BLACK 6.00

您可以阅读如下内容;订购行号2，产品数量KR01说明Eye Pencil Decadence Black，价格为6.00

是否有一种简单的方法可以读取这个长字符串并使用正确的订单行对其进行隔离。

感谢您的回复

我的代码到现在为止：

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package part4.chapter15;

import com.itextpdf.text.pdf.PdfArray;
import com.itextpdf.text.pdf.PdfName;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;

public class ExtractPageContent {

    /** The original PDF that will be parsed. */
    public static final String PREFACE = "C:/Users/marcel/Documents/FM/NL/FMPoint        /Kassa_voorraad_software/PDF-Itext/PDF_Results_Import_Files/small.pdf" ;
    /** The resulting text file. */
    public static final String RESULT = "C:/Users/marcel/Documents/FM/NL/FMPoint        /Kassa_voorraad_software/PDF-Itext/PDF_Results_Import_Files/sample-            result.txt" ;

    /**
     * Parses a PDF to a plain text file.
     * @param pdf the original PDF
     * @param txt the resulting text
 * @throws IOException
 */
public void parsePdf(String pdf, String txt) throws IOException {

        /** Putting result in Array, to be able extract to Table */
        PdfArray array;

        PdfReader reader = new PdfReader(pdf);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        PrintWriter out = new PrintWriter(new FileOutputStream(txt));
        TextExtractionStrategy strategy;
        for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
            String str = strategy.getResultantText();
            CharSequence FindPage = "Lp. Rekening Totaal"; 
            if  (str.contains(FindPage)){ 
              out.println(strategy.getResultantText());
        }
        }
        out.flush();
        out.close();
    }

    /**
     * Main method.
     * @param    args    no arguments needed
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        new ExtractPageContent().parsePdf(PREFACE, RESULT);
    }

}

Answer 1

您可以设计regex以多种方式解决此问题。这是一个：

    String pdf = "LEVERINGSBON 30/06/2012 27828/2012/NL/WebShop   Distributeur ID nummer: 15099191 Uw distributeur: Klant Naam: FM Point Marcel Snoeck Adres: Zonnedauw 17 5953MS Reuver Telefoon: +31654317017 E-MAIL: yvonneenmarcel@home.nl Opmerking: - Lp. Rekening Totaal FV/39525/2012/NL     vd Wal Sandra 72.00 1 3 x 354 - Luxury Collection 50ml NEW! 72.00 FV/39526/2012/NL     Slaats Tim 6.00 2 1 x KR01 - Eye Pencil DECADENCE BLACK 6.00 FV/39527/2012/NL     Nabben Britt 44.95 3 3 x E013 - Krachtreiniger 1000ml 24.75 4 2 x E016 -Tapijtreiniger 1000ml 9.20 5 1 x 3 Step Mascara PERFECT BLACK 11.00 FV/39528/2012/NL     Nabben Lieke 32.00 6 1 x 192 - Luxury Collection 50ml 21.00 7 1 x 3 Step Mascara PERFECT BLACK 11.00 FV/39529/2012/NL     Claessens Patrick 12.40 8 1 x P101 - Peeling VERBENA 12.40 FV/39530/2012/NL     Smits Yolanda 56.00 9 1 x E006 - Wasmiddel VIVID COLOURS 1000ml 7.00 10 2 x B023 - Body Lotion 200ml NEW 18.40 11 2 x 023 - Classic Collection 30ml 30.60 FV/39531/2012/NL     van Pol-Thijssen Silvia 34.70 12 1 x 110 - Classic Collection 50ml 15.30 13 1 x N003 - Nagellak HOT RED 7.00 14 1 x P103 - Peeling CHERRY BLOSSOM 12.40 Aantal: 21 Totaal: 258.05 € 1.17.4564.29482 1/1        ";
    String patternString = "\\d\\s\\d\\sx.*?\\d\\.\\d\\d";
    Matcher matcher = Pattern.compile(patternString).matcher(pdf);
    List<String> dataRows = new ArrayList<String>();
    while (matcher.find()) {
        dataRows.add(matcher.group());
    }
    System.out.println(dataRows);

正则表达式的解释：
\\d\\s\\d\\sx：匹配数字，空格，数字，空格，'x' .*?：匹配任意数量的任何字符，但匹配非贪婪的Why is this important? \\d\.\\d\\d：将最后一个数字与两位小数相匹配这可能需要根据您的数据变化情况进行调整，但这应该是一个很好的起点。

如果您需要一个自定义数据结构列表而不是String，您可以像这样获得匹配的各个部分：

...  
String patternString = "(\\d)\\s(\\d)\\sx.*?\\d\\.\\d\\d";
...
while (matcher.find()) {
    MyDataObj m = new MyDataObj();
    m.setSomeField(dataRows.add(matcher.group(1)));
    m.setAnotherField(dataRows.add(matcher.group(2)));
}

只需将您想要保留的每个值包含在模式中，并使用matcher.group(1)，matcher.group(2)等检索它们。（matcher.group(0)为您提供整个匹配）

Answer 2

答案的结果很棒以下代码的结果如下：

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package part4.chapter15;

import com.itextpdf.text.pdf.PdfArray;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class ExtractPageContent {

/** The original PDF that will be parsed. */
    public static final String PREFACE = "C:/Users/marcel/Documents/FM/NL/FMPoint/Kassa_voorraad_software/PDF-Itext/PDF_Results_Import_Files/big.pdf" ;
    /** The resulting text file. */
    public static final String RESULT = "C:/Users/marcel/Documents/FM/NL/FMPoint/Kassa_voorraad_software/PDF-Itext/PDF_Results_Import_Files/sample-result.txt" ;

    /**
     * Parses a PDF to a plain text file.
     * @param pdf the original PDF
     * @param txt the resulting text
     * @throws IOException
     */
    public void parsePdf(String pdf, String txt) throws IOException {

        /** Putting result in Array, to be able extract to Table */
        PdfArray array;

        PdfReader reader = new PdfReader(pdf);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        PrintWriter out = new PrintWriter(new FileOutputStream(txt));
        TextExtractionStrategy strategy;
        for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
            String str = strategy.getResultantText();
            CharSequence FindPage = "Lp. Rekening Totaal"; 
            if  (str.contains(FindPage)){ 
/*                Pattern re =  Pattern.compile("(\\d+)\\s(\\d+)(\\xA0)x(\\xA0)(.*?)(\\d+\\.\\d{2})"); */
                /* Pattern for orders of Artikels with product Code */
                Pattern re2 =  Pattern.compile("(\\d+)\\s(\\d+)(\\xA0)x(\\xA0)(\\w+)(\\xA0)-\\s(.*?)(\\d+\\.\\d{2})"); 
                Matcher m = re2.matcher(str);
                int mIdx = 0;
                while (m.find()){
                    for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
                        /*System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));*/
                        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
                    }
                    mIdx++;
                }

/**     System.out.println(dataRows); */

          out.println(strategy.getResultantText());
    }
    }
    out.flush();
    out.close();
}


/**
 * Main method.
 * @param    args    no arguments needed
 * @throws IOException
 */
public static void main(String[] args) throws IOException {
    new ExtractPageContent().parsePdf(PREFACE, RESULT);
}

}

OUtput Results如下所示。

完成订单行 [0] [0] = 4 3 x 023 - 经典系列30ml 45.90

行号 [0] [1] = 4

Quatinty [0] [2] = 3

清空 [0] [3] =

清空 [0] [4] =

产品代码 [0] [5] = 023

清空 [0] [6] =

产品说明 [0] [7] =经典系列30ml

价格 [0] [8] = 45.90

[1] [0] = 5 2 x C052 - 手部和指甲霜100ml新15.20

[1] [1] = 5

[1] [2] = 2

[1] [3] =

[1] [4] =

[1] [5] = C052

[1] [6] =

[1] [7] =手部和指甲霜100ml新

[1] [8] = 15.20

感谢大力支持

如何将长字符串转换为数组或数据库字段？

2 个答案: