将单词直接插入db中的表中,而不是使用Java存储在arraylist中

时间:2018-11-27 08:20:17

标签: java

我正在构建一个索引程序,其中提取文件(PDF)并提取其中的所有单词并将其存储在arrayList中。同时,我必须定义单词标记以查看要索引的单词及其规则,因此我将它们存储在arrayList中,以便可以替换正则表达式以满足自己的需求。

代码:

public void index(String path) throws Exception {
    ArrayList<String> list = new ArrayList<String>();
    PDDocument document = PDDocument.load(new File(path));

    if (!document.isEncrypted()) {
        PDFTextStripper tStripper = new PDFTextStripper();
        String pdfFileInText = tStripper.getText(document);
        String lines[] = pdfFileInText.split("\\r?\\n");
        for (String line : lines) {
            String[] words = line.split(" ");

            for (String word : words) {
                //check if one/more special characters at end of string then remove OR
                //check special characters in beginning of the string then remove

                list.add(word.replaceAll("([\\W]+$)|(^[\\W]+)", ""));

            }
        }
    }

    String[] words1 = list.toArray(new String[list.size()]);
    String sql = "insert IGNORE into  test.indextable123 values (?,?)";
    preparedStatement = con.connect().prepareStatement(sql);

    for (int i = 1; i < words1.length; i++) {
        preparedStatement.setString(1, words1[i]);
        preparedStatement.setString(2, path);
        preparedStatement.addBatch();

        if (i % 1000 == 0) {
            preparedStatement.executeBatch();
            System.out.print("Add Thousand");
        }
    }

    if (words1.length % 1000 > 0) {
        preparedStatement.executeBatch();
        System.out.print("Add Remaining");
    }

    preparedStatement.close();
    System.out.println("Completed");
}

这里的问题是,如果我想索引一个超过1000万个单词的文件,存储在arrayList中的资源效率不高,并且还会抛出outofmemory exception

同时,我需要将其存储在数组中以用代码中所示的正则表达式“替换”。有没有办法在提取单词后直接将单词插入db,同时过滤单词以适合我需要的正则表达式?

3 个答案:

答案 0 :(得分:2)

我相信真正的问题是您如何阅读PDF文件。您调用String pdfFileInText = tStripper.getText(document);,它将整个文件中的文本加载到字符串中。然后,您对其进行迭代并插入数据库中。将文件的内容加载到字符串中可能会导致内存问题,这就是我们通常使用流(例如inputstream,outputstream等)的原因。它们为您提供了一种在读取文件时对其进行处理的方法,而不是将其批量加载然后进行处理。

如果检查PDFTextStripper的工作方式,则可以看到getText方法:

 public  String getText( PDDocument doc ) throws IOException
    {
        StringWriter outputStream = new StringWriter();
        writeText( doc, outputStream );
        return outputStream.toString();
    }

它使用writeText方法,该方法使用输出流并将其收集到String中。  因此,您有几种选择:

  • 为了避免出现内存高峰,您需要编写一个自定义PDFTextStripper并覆盖某些方法,例如,您可以覆盖writeText方法并将其更改为写入数据库。

  • 您可以逐页处理PDF,这样可以限制负载-我相信有一个processPage方法可用于对代码进行一些修改

  • 您可以通过创建一个自定义OutputStream来制作超酷的解决方案,该OutputStream将内容直接存储到数据库中并将其传递给PDFTextStripper的writeMethod

我发现最后一种方法是最有趣的方法(即使逐页处理它可能更可靠)。因此,我将给出一个示例代码,您可以将其用作参考。仍然需要进行一些修改才能正常工作:

首先创建一个自定义编写器。像这样:

class MyDatabaseWriter extends java.io.Writer{

    private StringBuilder lineBuilder=new StringBuilder();
    //DB stuff go here

    @Override
    public void close() throws IOException {
        //Close DB Connection 
    }

    @Override
    public void flush() throws IOException {
    }

    @Override
    public void write(char[] cbuf, int off, int len) throws IOException {
        String newString=new String(cbuf, off, len);
        lineBuilder.append(newString);
        lineBuilder.toString().matches("\\r?\\n");
        String lines[] = lineBuilder.toString().split("\\r?\\n");
        writeLineToDatabase(lines[0]);
        lineBuilder=new StringBuilder(lines[1]);
    }

    private void writeLineToDatabase(String line) {
        // Process your line and add it to the database
    }

}

然后将所有数据库内容移至writer,在您的主类中,您应该具有以下内容:

PDDocument document = PDDocument.load(new File(path));
PDFTextStripper tStripper = new PDFTextStripper();
tStripper.writeText(document, new MyDatabaseWriter());  //Or if you create an instance in another way

PDFTextStripper扩展了PDFStreamEngine(不是偶然的:),因此它将读取的流传递给自定义编写器,您可以将其直接发送到数据库。它仅将当前行存储在内存中。

答案 1 :(得分:1)

只需实时存储数据即可。

    PDDocument document = PDDocument.load(new File(path));

    if (!document.isEncrypted()) {

        String sql = "insert IGNORE into  test.indextable123 values (?,?)";

        PreparedStatement preparedStatement = con.connect().prepareStatement(sql);
        try {
            int i = 0;
            PDFTextStripper tStripper = new PDFTextStripper();
            String pdfFileInText = tStripper.getText(document);
            String lines[] = pdfFileInText.split("\\r?\\n");
            for (String line : lines) {
                String[] words = line.split(" ");

                for (String word : words) {
                    // check if one or more special characters at end of string then remove OR
                    // check special characters in beginning of the string then remove

                    preparedStatement.setString(1, word.replaceAll("([\\W]+$)|(^[\\W]+)", ""));

                    preparedStatement.setString(2, path);

                    preparedStatement.addBatch();
                    ++i;
                    if (i == 1000) {
                        i = 0;
                        preparedStatement.executeBatch();

                        System.out.print("Add Thousand");
                    }
                }

            }
            if (i > 0) {
                preparedStatement.executeBatch();

                System.out.print("Add Remaining");
            }

        } finally {
            preparedStatement.close();
        }
        System.out.println("Completed");
    }

更新:摆脱lines数组:

    private static final Pattern WORD_PATTERN = Pattern.compile("\\w+");

...

    PDDocument document = PDDocument.load(new File(path));

    if (!document.isEncrypted()) {

        String sql = "insert IGNORE into  test.indextable123 values (?,?)";

        PDFTextStripper tStripper = new PDFTextStripper();
        String pdfFileInText = tStripper.getText(document);
        PreparedStatement preparedStatement = con.connect().prepareStatement(sql);
        try {
            int i = 0;
            Matcher matcher = WORD_PATTERN.matcher(pdfFileInText);
            while (matcher.find()) {
                String word = matcher.group();
                // check if one or more special characters at end of string then remove OR
                // check special characters in beginning of the string then remove

                preparedStatement.setString(1, word.replaceAll("([\\W]+$)|(^[\\W]+)", ""));

                preparedStatement.setString(2, path);

                preparedStatement.addBatch();
                ++i;
                if (i == 1000) {
                    i = 0;
                    preparedStatement.executeBatch();

                    System.out.print("Add Thousand");
                }
            }
            if (i > 0) {
                preparedStatement.executeBatch();

                System.out.print("Add Remaining");
            }

        } finally {
            preparedStatement.close();
        }
        System.out.println("Completed");
    }

更新2:使用@Veselin建议的自定义Writer

    PDDocument document = PDDocument.load(new File(path));

    if (!document.isEncrypted()) {

        String sql = "insert IGNORE into  test.indextable123 values (?,?)";

        PDFTextStripper tStripper = new PDFTextStripper();
        PreparedStatement preparedStatement = con.prepareStatement(sql);
        try {
            Writer writer = new Writer(){
                final StringBuilder buf = new StringBuilder();
                int i = 0;

                @Override
                public void write(char[] cbuf, int off, int len)
                        throws IOException {
                    int end = off + len;
                    for (int i = off; i < end; ++i) {
                        char c = cbuf[i];
                        if (Character.isLetterOrDigit(c)) {
                            buf.append(c);
                        } else if (buf.length() > 0) {
                            processBuf();
                        }
                    }
                }

                @Override
                public void flush() throws IOException {
                }

                @Override
                public void close() throws IOException {
                    if (buf.length() > 0) {
                        processBuf();
                    }
                    if (i > 0) {
                        preparedStatement.executeBatch();
                    }
                }

                private void processBuf() {
                    String word = buf.toString();
                    buf.setLength(0);
                    preparedStatement.setString(1, word);
                    preparedStatement.setString(2, path);
                    preparedStatement.addBatch();
                    ++i;
                    if (i == 1000) {
                        i = 0;
                        preparedStatement.executeBatch();
                        System.out.print("Add Thousand");
                    }
                }
            };
            tStripper.writeText(document, writer);
            writer.close();
        } finally {
            preparedStatement.close();
        }
        System.out.println("Completed");
    }
}

答案 2 :(得分:0)

要重申,不需要其他数组或列表

String sql = "insert IGNORE into  test.indextable123 values (?,?)";
preparedStatement = con.connect().prepareStatement(sql);
int i = 0;

for (String word : words) {
     word = word.replaceAll("([\\W]+$)|(^[\\W]+)", "");

    preparedStatement.setString(1, word);
    preparedStatement.setString(2, path);
    preparedStatement.addBatch();

    i++;
    if (i % 1000 == 0) {
        preparedStatement.executeBatch();
        System.out.print("Add Thousand");
    }

}

if (i > 0) {
    preparedStatement.executeBatch();
    System.out.print("Add Remaining");
}

preparedStatement.close();
System.out.println("Completed");