Question

我正在试图弄清楚如何使用正则表达式来压缩和排序我从这段代码中获取的信息。这是代码，我会解释：

import java.io.*;
import java.util.*;

public class baseline 
{

// Class level variables
   static Scanner sc = new Scanner(System.in); 

   public static void main(String[] args) throws IOException, 
   FileNotFoundException { // Start of main

   // Variables  
   String filename;

   // Connecting to the output file with a buffer
   PrintWriter outFile = new PrintWriter(
                          new BufferedWriter(
                           new FileWriter("chatOutput.log")));

   // Get the input file
   System.out.print("Please enter full name of the file: ");
   filename = sc.next();

   // Assign the name of the input file to a file object
   File log = new File(filename);
   String textLine = null; // Null
   String outLine = "";    // Null
   BufferedWriter bw = null;



  try
  {
  // assigns the input file to a filereader object
     BufferedReader infile = new BufferedReader(new FileReader(log));

      sc = new Scanner(log);
            while(sc.hasNext())
            {
                String line=sc.nextLine();
                if(line.contains("LANTALK"))
                    System.out.println(line);
            } // End of while


  try
   {
     // Read data from the input file
    while((textLine = infile.readLine()) != null)
     {
    // Print to output file
    outLine = textLine;
    sc = new Scanner (outLine);
          while(sc.hasNext())
          {
               String line=sc.nextLine();
               if(line.contains("LANTALK"))
                    outFile.printf("%s\n",outLine);
          }// end of while 
      } // end of while
    } // end of try


   finally  // This gets executed even when an exception is thrown 
      {
    infile.close();
    outFile.close();
      } // End of finally
    } // End of try


  catch (FileNotFoundException nf) // Goes with first try
  {
   System.out.println("The file \""+log+"\" was not found"); 
  } // End of catch
  catch (IOException ioex) // Goes with second try
  {
   System.out.println("Error reading the file");
  } // End of catch

 } // end of main

} // end of class

所以我正在读取一个输入文件，只获取显示“LANTALK”的行，并将它们打印到另一个文件。以下是目前输出结果的示例：

14:29:39.731 [D] [T:000FEC] [F:LANTALK2C] <CMD>LANMSG</CMD>
<MBXID>922</MBXID><MBXTO>5608</MBXTO><SUBTEXT>LanTalk</SUBTEXT><MOBILEADDR>
</MOBILEADDR><LAP>0</LAP><SMS>0</SMS><MSGTEXT>It is mailing today right?
</MSGTEXT>
14:41:33.703 [D] [T:000FF4] [F:LANTALK2C] <CMD>LANMSG</CMD>
<MBXID>929</MBXID><MBXTO>5601</MBXTO><SUBTEXT>LanTalk</SUBTEXT><MOBILEADDR>
</MOBILEADDR><LAP>0</LAP><SMS>0</SMS><MSGTEXT>Either today or tomorrow - 
still waiting to hear. </MSGTEXT>

我需要的是让<MSGTEXT>和</MSGTEXT>之间的所有字符能够干净地显示消息。我应该如何将其写入代码中以重复每个“LANTALK”行并仍然正确写出？谢谢！

Answer 1

您可以使用正则表达式找到MSGTEXT：

<MSGTEXT>(.*?)</MSGTEXT>

但是，有些消息包含换行符，这使得这有点困难。

解决此问题的一种方法是将整个文件读入String，然后查找匹配项。

try {
    String text = new String(Files.readAllBytes(Paths.get(log)));
    Matcher m = Pattern.compile("<MSGTEXT>(.*?)</MSGTEXT>", Pattern.DOTALL).matcher(text);
    while (m.find()) {
        System.out.println("Message: " + m.group(1));
    }
} catch (IOException e) {
    //Handle exception
}

控制台输出：

Message: It is mailing today right?

Message: Either today or tomorrow - 
still waiting to hear.

请记住，如果您处理大型日志文件，这种方法可能会占用大量内存。

另请注意，使用正则表达式解析XML通常被认为是一个坏主意;它现在工作正常，但如果你打算做更复杂的事情，你应该使用其他人建议的XML解析器。

Answer 2

使用Jsoup尝试一下。

示例：

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;


....

while(sc.hasNext())
        {
            String line=sc.nextLine();
            if(line.contains("LANTALK")){
               Document doc = Jsoup.parse(line);
               Element  msg = doc.select("MSGTEXT").first();
               System.out.println(msg.text());
            }
                System.out.println(line);
        } // End of while
    .....

正则表达式显示文本日志文件中的消息

2 个答案: