Question

我有一个格式如下的文件。

.I 1
.T
experimental investigation of the aerodynamics of a
wing in a slipstream . 1989
.A
brenckman,m.
.B
experimental investigation of the aerodynamics of a
wing in a slipstream .
.I 2
.T
simple shear flow past a flat plate in an incompressible fluid of small
viscosity .
.A
ting-yili
.B
some texts...
some more text....
.I 3
...

＆＃34; .I 1 ＆＃34;指示与 doc ID1 和＆＃34; .I 2 ＆＃34对应的文本块的开头;表示与 doc ID2 对应的文本块的开头。

我需要的内容是读取＆＃34; .I 1＆＃34;之间的文字。和＆＃34; .I 2＆＃34;并将其保存为单独的文件，如＆＃34; DOC_ID_1.txt＆＃34;然后阅读＆＃34; .I 2＆＃34;之间的文字。和＆＃34; .I 3＆＃34; 并将其保存为单独的文件，如＆＃34; DOC_ID_2.txt＆＃34;等等。 让我们假设.I＃的数量未知。

我试过这个但是无法完成它。任何帮助将不胜感激

String inputDocFile="C:\\Dropbox\\Data\\cran.all.1400";     
try {
     File inputFile = new File(inputDocFile);
     FileReader fileReader = new FileReader(inputFile);
     BufferedReader bufferedReader = new BufferedReader(fileReader);
     String line=null;
     String outputDocFileSeperatedByID="DOC_ID_";
     //Pattern docHeaderPattern = Pattern.compile(".I ", Pattern.MULTILINE | Pattern.COMMENTS);
     ArrayList<ArrayList<String>> result = new ArrayList<> ();
     int docID =0;
     try {
          StringBuilder sb = new StringBuilder();
          line = bufferedReader.readLine();
          while (line != null) {
              if (line.startsWith(".I"))
              { 
                 result.add(new ArrayList<String>());
                 result.get(docID).add(".I");
                 line = bufferedReader.readLine();

                 while(line != null && !line.startsWith(".I")){
                    line = bufferedReader.readLine();
                    }
                     ++docID;
              }        
              else line = bufferedReader.readLine();
          }

      } finally {
          bufferedReader.close();
      }
   } catch (IOException ex) {
      Logger.getLogger(ReadFile.class.getName()).log(Level.SEVERE, null, ex);
   }

Answer 1

查找正则表达式，Java为此提供了内置库。

https://docs.oracle.com/javase/tutorial/essential/regex/

http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

这些链接将为您提供一个起点，实际上您可以使用计数器对字符串执行模式匹配，并在第一个模式匹配和第二个模式匹配之间存储任何内容。可以使用Formatter类将此信息输出到单独的文件中。

在此处找到： - http://docs.oracle.com/javase/7/docs/api/java/util/Formatter.html

Answer 2

您想要找到与“我”相匹配的行。

您需要的正则表达式是：^.I \d$

^表示该行的开头。因此，如果在I之前有一些空格或文本，则该行将与正则表达式不匹配。
\d表示任何数字。为了简单起见，我只允许这个正则表达式中的一个数字。
$表示该行的结尾。因此，如果数字后面有一些字符，则该行与表达式不匹配。

现在，您需要逐行读取文件并保留对您编写当前行的文件的引用。

使用Files.lines();

在Java 8中逐行读取文件要容易得多

private String currentFile = "root.txt";

public static final String REGEX = "^.I \\d$";

public void foo() throws Exception{

  Path path = Paths.get("path/to/your/input/file.txt");
  Files.lines(path).forEach(line -> {
    if(line.matches(REGEX)) {
      //Extract the digit and update currentFile
      currentFile = "File DOC_ID_"+line.substring(3, line.length())+".txt";
      System.out.println("Current file is now : currentFile);
    } else {
      System.out.println("Writing this line to "+currentFile + " :" + line);
      //Files.write(...);
    }
  });

注意：为了提取数字，我使用了原始"".substring()，我认为它是邪恶的，但更容易理解。您可以使用Pattern和Matcher：

更好地完成此操作

使用此正则表达式：“.I (\\d)”。（与之前相同，但用括号表示您想要捕获的内容）。然后：

Pattern pattern = Pattern.compile(".I (\\d)");
Matcher matcher = pattern.matcher(".I 3");
if(matcher.find()) {
  System.out.println(matcher.group(1));//display "3"
}

Answer 3

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintWriter;

public class Test {

    /**
     * @param args
     * @throws IOException 
     */
    public static void main(String[] args) throws IOException {
        // TODO Auto-generated method stub
        String inputFile="C:\\logs\\test.txt"; 
         BufferedReader br = new BufferedReader(new FileReader(new File(inputFile)));
         String line=null;
         StringBuilder sb = new StringBuilder();
         int count=1;
        try {
            while((line = br.readLine()) != null){
                if(line.startsWith(".I")){
                    if(sb.length()!=0){
                        File file = new File("C:\\logs\\DOC_ID_"+count+".txt");
                        PrintWriter writer = new PrintWriter(file, "UTF-8");
                        writer.println(sb.toString());
                        writer.close();
                        sb.delete(0, sb.length());
                        count++;
                    }
                    continue;
                }
                sb.append(line);
            }

           } catch (Exception ex) {
             ex.printStackTrace();
           }
           finally {
                  br.close();

              }
    }

}

按特定字符序列

3 个答案: