如何为这个txt文件编写java正则表达式

时间:2013-11-28 10:05:14

标签: java regex

感谢您的帮助

我希望获得txt文件中每个项目的 ID 和类别,如下所示:

Id:   0
ASIN: 0771044445
discontinued product

Id:   1
ASIN: 0827229534
  title: Patterns of Preaching: A Sermon Sampler
  group: Book
  salesrank: 396585
  similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X
  categories: 2
   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Preaching[12368]
   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Christianity[12290]|Clergy[12360]|Sermons[12370]
  reviews: total: 2  downloaded: 2  avg rating: 5
    2000-7-28  cutomer: A2JW67OY8U6HHK  rating: 5  votes:  10  helpful:   9
    2003-12-14  cutomer: A2VE83MZF98ITY  rating: 5  votes:   6  helpful:   5

Id:   2
ASIN: 0738700797
  title: Candlemas: Feast of Flames
  group: Book
  salesrank: 168596
  similar: 5  0738700827  1567184960  1567182836  0738700525  0738700940
  categories: 2
   |Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Wicca[12484]
......

结果应按以下方式组织:

1 Book

2 Book

3 Book

然后,我编写了一个java程序来提取信息:

class Main
{

  public static void main(String[] args) throws IOException
  { 

    String file="/Users/swing/Desktop/test.rtf";  

      BufferedReader br;

      try 
      {
          br = new BufferedReader(new FileReader(file));

          String line;      

          String re1=".*?"; // Non-greedy match on filler
          String re2="";    // ID 1

          String re3="((?:[c-z][a-z]+))";   // Category 1

          Pattern p = Pattern.compile(re1+re2+re3,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
          Matcher m = p.matcher(file);

          while((line=br.readLine())!=null)
          {
            m=p.matcher(line);

              if (m.find())
              {
                  String id1=m.group(1);
                  String category1=m.group(2);
                  System.out.print(" "+id1.toString()+" "+" "+category1.toString()+" "+"\n");
              }    
          } 
      }  
      catch (FileNotFoundException e)    
      {         
          e.printStackTrace();  
          System.out.println("fail");}   
      }
}

由于我没有使用 java 正则表达式的经验,所以结果错误如下,你能帮我纠正错误的代码吗?谢谢!

输出错误:

\r  tf 

\font  tbl 

color  tbl 

ar  gl 

ardir  natural 

ardir  natural 

AS  IN 

dis  continued 

AS  IN 

tit  le 

gro  up 

ales  rank 

simi  lar 

....

1 个答案:

答案 0 :(得分:0)

尝试使用此正则表达式以及您选择的选项(意味着dotall和不区分大小写):

<强>模式

Id:\s+?(\d).+?(?:group:|discontinued)\s(\w+?)\s

<强> INPUT

您在问题中提供的.txt文件

<强>输出

匹配

1. Group 1: 0
   Group 2: product

2. Group 1: 1
   Group 2: Book

3. Group 1: 2
   Group 2: Book