如何用分隔符解析文本?

时间:2009-11-20 10:51:23

标签: java

  

可能重复:
  How to parse this output and separate each field/word

我想解析以下数据,以便获得下面指定的输出。

输入:

RTRV-ALM-EQPT::ALL:RA01;

   SIMULATOR 09-11-20 13:52:15
M  RA01 COMPLD
   "SLOT-1-1-1,CMP:MN,T-FANCURRENT-1-HIGH,NSA,01-10-09,00-00-00,,:\"Fan-T\","
   "SLOT-1-1-1,CMP:MJ,T-BATTERYPWR-2-LOW,NSA,01-10-09,00-00-00,,:\"Battery-T\","
   "SLOT-1-1-2,CMP:CR,PROC_FAIL,SA,09-11-20,13-51-55,,:\"Processor Failure\","
   "SLOT-1-1-3,OLC:MN,T-LASERCURR-1-HIGH,SA, 01-10-07,13-21-03,,:\"Laser-T\","
   "SLOT-1-1-3,OLC:MJ,T-LASERCURR-2-LOW,NSA, 01-10-02,21-32-11,,:\" Laser-T\","
   "SLOT-1-1-4,OLC:MN,T-LASERCURR-1-HIGH,SA,01-10-05,02-14-03,,:\"Laser-T\","
   "SLOT-1-1-4,OLC:MJ,T-LASERCURR-2-LOW,NSA,01-10-04,01-03-02,,:\"Laser-T\","
;

输出:

1) RTRV-ALM-EQPT::ALL:RA01;
2) SIMULATOR 
3) 09-11-20 
4) 13:52:15
5) M  
6) RA01 
7) COMPLD
8) "SLOT-1-1-1,CMP:MN,T-FANCURRENT-1-HIGH,NSA,01-10-09,00-00-00,,:\"Fan-T\","
9) "SLOT-1-1-1,CMP:MJ,T-BATTERYPWR-2-LOW,NSA,01-10-09,00-00-00,,:\"Battery-T\","
10) "SLOT-1-1-2,CMP:CR,PROC_FAIL,SA,09-11-20,13-51-55,,:\"Processor Failure\","
11) "SLOT-1-1-3,OLC:MN,T-LASERCURR-1-HIGH,SA, 01-10-07,13-21-03,,:\"Laser-T\","
12) "SLOT-1-1-3,OLC:MJ,T-LASERCURR-2-LOW,NSA, 01-10-02,21-32-11,,:\" Laser-T\","
13) "SLOT-1-1-4,OLC:MN,T-LASERCURR-1-HIGH,SA,01-10-05,02-14-03,,:\"Laser-T\","
14) "SLOT-1-1-4,OLC:MJ,T-LASERCURR-2-LOW,NSA,01-10-04,01-03-02,,:\"Laser-T\","

3 个答案:

答案 0 :(得分:1)

最好的方法可能不是考虑将第一个文本转换为第二个文本。

相反,首先考虑将第一个文本解析为一组Java对象,表示它们实际上是什么。例如,输入的第二行/第三行可能由Test类表示,其中包含“area”,“day”和“time”属性。 (只有你可以根据你对一切意义的了解,提出一个合理的模型。)

然后,一旦你有一个很好的内存中的文件信息表示,你可以考虑打印到文本,如第二种情况。现在应该很容易从Java对象中打印出各种字段和属性,而不是试图动态转换输入文本。

答案 1 :(得分:1)

假设文件相对较小,因此可以读入内存。尝试这样的事情:

public class Main { 
    public static void main(String[] args) {
        String text = "RTRV-ALM-EQPT::ALL:RA01;\n"+
            "\n"+
            "   SIMULATOR 09-11-20 13:52:15\n"+
            "M  RA01 COMPLD\n"+
            "   \"SLOT-1-1-1,CMP:MN,T-FANCURRENT-1-HIGH,NSA,01-10-09,00-00-00,,:\\\"Fan-T\\\",\"\n"+
            "   \"SLOT-1-1-1,CMP:MJ,T-BATTERYPWR-2-LOW,NSA,01-10-09,00-00-00,,:\\\"Battery-T\\\",\"\n"+
            "   \"SLOT-1-1-2,CMP:CR,PROC_FAIL,SA,09-11-20,13-51-55,,:\\\"Processor Failure\\\",\"\n"+
            "   \"SLOT-1-1-3,OLC:MN,T-LASERCURR-1-HIGH,SA, 01-10-07,13-21-03,,:\\\"Laser-T\\\",\"\n"+
            "   \"SLOT-1-1-3,OLC:MJ,T-LASERCURR-2-LOW,NSA, 01-10-02,21-32-11,,:\\\" Laser-T\\\",\"\n"+
            "   \"SLOT-1-1-4,OLC:MN,T-LASERCURR-1-HIGH,SA,01-10-05,02-14-03,,:\\\"Laser-T\\\",\"\n"+
            "   \"SLOT-1-1-4,OLC:MJ,T-LASERCURR-2-LOW,NSA,01-10-04,01-03-02,,:\\\"Laser-T\\\",\"\n"+
            ";";
        Matcher m = Pattern.compile("\"(?:\\\\.|[^\\\"])*\"|\\S+").matcher(text);
        int n = 0;
        while(m.find()) {
            System.out.println((++n)+") "+m.group());
        }
    }
}

输出:

1) RTRV-ALM-EQPT::ALL:RA01;
2) SIMULATOR
3) 09-11-20
4) 13:52:15
5) M
6) RA01
7) COMPLD
8) "SLOT-1-1-1,CMP:MN,T-FANCURRENT-1-HIGH,NSA,01-10-09,00-00-00,,:\"Fan-T\","
9) "SLOT-1-1-1,CMP:MJ,T-BATTERYPWR-2-LOW,NSA,01-10-09,00-00-00,,:\"Battery-T\","
10) "SLOT-1-1-2,CMP:CR,PROC_FAIL,SA,09-11-20,13-51-55,,:\"Processor Failure\","
11) "SLOT-1-1-3,OLC:MN,T-LASERCURR-1-HIGH,SA, 01-10-07,13-21-03,,:\"Laser-T\","
12) "SLOT-1-1-3,OLC:MJ,T-LASERCURR-2-LOW,NSA, 01-10-02,21-32-11,,:\" Laser-T\","
13) "SLOT-1-1-4,OLC:MN,T-LASERCURR-1-HIGH,SA,01-10-05,02-14-03,,:\"Laser-T\","
14) "SLOT-1-1-4,OLC:MJ,T-LASERCURR-2-LOW,NSA,01-10-04,01-03-02,,:\"Laser-T\","
15) ;

唯一的区别是第15场比赛:;,我相信你忘了。

原始正则表达式(没有所有转义)看起来像这样:

"(?:\\.|[^\\"])*"|\S+

和匹配:

"          # match a double quote
(?:        # open non matching group 1
  \\.      #   match a backslash followed by any char (except line breaks)
  |        #   OR
  [^\\"]   #   match any char except a backslash and a double quote
)*         # close non matching group 1 and repeat it zero or more times
"          # match a double quote
|          # OR
\S+        # match one or more characters other than white space chars

换句话说:匹配带引号的字符串或匹配仅由非空格字符组成的单词

答案 2 :(得分:0)

要解析任何输入,您必须知道其结构。

  1. 前四行总是存在吗?
  2. 这四行中每一行的格式是什么?