正则表达式选择rtf源的特定部分

时间:2015-08-07 11:41:15

标签: java regex

我试图选择从(包含){\\*\listtable{\\*\listoverridetable{开始的所有数据(不包括)

这是一个简单的RTF:

    {\\rtf1\adeflang1\ansi\\ansicpg1\uc1\adeff3\deff0\stshfdbch3\stshfloch3\stshfhich3\stshfbi3\deflang1\deflangfe1\themelang1\themelangfe0\themelangcs0{\\fonttbl{{\\f0\fbidi \froman\\fcharset0\fprq2{\\*\panose 02020603050405020304}Times New Roman;}{\\f0\fbidi \froman\\fcharset0\fprq2{\\*\panose 02020603050405020304}Times New Roman;}
{\\f3\fbidi \fswiss\\fcharset0\fprq2{\\*\panose 020f0502020204030204}Calibri;}{\\flomajor\\f3\fbidi \froman\\fcharset0\fprq2{\\*\panose 02020603050405020304}Times New Roman;}
{\\fdbmajor\\f3\fbidi \froman\\fcharset0\fprq2{\\*\panose 02020603050405020304}Times New Roman;}{\\fhimajor\\f3\fbidi \fswiss\\fcharset0\fprq2{\\*\panose 020f0302020204030204}Calibri Light;}
{\\fbimajor\\f3\fbidi \froman\\fcharset0\fprq2{\\*\panose 02020603050405020304}Times New Roman;}{\\flominor\\f3\fbidi \froman\\fcharset0\fprq2{\\*\panose 02020603050405020304}Times New Roman;}
{\\fdbminor\\f3\fbidi \froman\\fcharset0\fprq2{\\*\panose 02020603050405020304}Times New Roman;}{\\fhiminor\\f3\fbidi \fswiss\\fcharset0\fprq2{\\*\panose 020f0502020204030204}Calibri;}
{\\fbiminor\\f3\fbidi \froman\\fcharset0\fprq2{\\*\panose 02020603050405020304}Times New Roman;}{\\f3\fbidi \froman\\fcharset2\fprq2Times New Roman CE;}{\\f4\fbidi \froman\\fcharset2\fprq2Times New Roman Cyr;}
{\\f4\fbidi \froman\\fcharset1\fprq2Times New Roman Greek;}{\\f4\fbidi \froman\\fcharset1\fprq2Times New Roman Tur;}{\\f4\fbidi \froman\\fcharset1\fprq2Times New Roman (Hebrew);}{\\f4\fbidi \froman\\fcharset1\fprq2Times New Roman (Arabic);}
{\\f4\fbidi \froman\\fcharset1\fprq2Times New Roman Baltic;}{\\f4\fbidi \froman\\fcharset1\fprq2Times New Roman (Vietnamese);}{\\f3\fbidi \froman\\fcharset2\fprq2Times New Roman CE;}{\\f4\fbidi \froman\\fcharset2\fprq2Times New Roman Cyr;}
{\\f4\fbidi \froman\\fcharset1\fprq2Times New Roman Greek;}{\\f4\fbidi \froman\\fcharset1\fprq2Times New Roman Tur;}{\\f4\fbidi \froman\\fcharset1\fprq2Times New Roman (Hebrew);}{\\f4\fbidi \froman\\fcharset1\fprq2Times New Roman (Arabic);}
{\\f4\fbidi \froman\\fcharset1\fprq2Times New Roman Baltic;}{\\f4\fbidi \froman\\fcharset1\fprq2Times New Roman (Vietnamese);}{\\f4\fbidi \fswiss\\fcharset2\fprq2Calibri CE;}{\\f4\fbidi \fswiss\\fcharset2\fprq2Calibri Cyr;}
{\\f4\fbidi \fswiss\\fcharset1\fprq2Calibri Greek;}{\\f4\fbidi \fswiss\\fcharset1\fprq2Calibri Tur;}{\\f4\fbidi \fswiss\\fcharset1\fprq2Calibri Baltic;}{\\f4\fbidi \fswiss\\fcharset1\fprq2Calibri (Vietnamese);}
{\\flomajor\\f3\fbidi \froman\\fcharset2\fprq2Times New Roman CE;}{\\flomajor\\f3\fbidi \froman\\fcharset2\fprq2Times New Roman Cyr;}{\\flomajor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman Greek;}
{\\flomajor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman Tur;}{\\flomajor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman (Hebrew);}{\\flomajor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman (Arabic);}
{\\flomajor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman Baltic;}{\\flomajor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman (Vietnamese);}{\\fdbmajor\\f3\fbidi \froman\\fcharset2\fprq2Times New Roman CE;}
{\\fdbmajor\\f3\fbidi \froman\\fcharset2\fprq2Times New Roman Cyr;}{\\fdbmajor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman Greek;}{\\fdbmajor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman Tur;}
{\\fdbmajor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman (Hebrew);}{\\fdbmajor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman (Arabic);}{\\fdbmajor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman Baltic;}
{\\fdbmajor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman (Vietnamese);}{\\fhimajor\\f3\fbidi \fswiss\\fcharset2\fprq2Calibri Light CE;}{\\fhimajor\\f3\fbidi \fswiss\\fcharset2\fprq2Calibri Light Cyr;}
{\\fhimajor\\f3\fbidi \fswiss\\fcharset1\fprq2Calibri Light Greek;}{\\fhimajor\\f3\fbidi \fswiss\\fcharset1\fprq2Calibri Light Tur;}{\\fhimajor\\f3\fbidi \fswiss\\fcharset1\fprq2Calibri Light Baltic;}
{\\fhimajor\\f3\fbidi \fswiss\\fcharset1\fprq2Calibri Light (Vietnamese);}{\\fbimajor\\f3\fbidi \froman\\fcharset2\fprq2Times New Roman CE;}{\\fbimajor\\f3\fbidi \froman\\fcharset2\fprq2Times New Roman Cyr;}
{\\fbimajor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman Greek;}{\\fbimajor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman Tur;}{\\fbimajor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman (Hebrew);}
{\\fbimajor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman (Arabic);}{\\fbimajor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman Baltic;}{\\fbimajor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman (Vietnamese);}
{\\flominor\\f3\fbidi \froman\\fcharset2\fprq2Times New Roman CE;}{\\flominor\\f3\fbidi \froman\\fcharset2\fprq2Times New Roman Cyr;}{\\flominor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman Greek;}
{\\flominor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman Tur;}{\\flominor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman (Hebrew);}{\\flominor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman (Arabic);}
{\\flominor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman Baltic;}{\\flominor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman (Vietnamese);}{\\fdbminor\\f3\fbidi \froman\\fcharset2\fprq2Times New Roman CE;}
{\\fdbminor\\f3\fbidi \froman\\fcharset2\fprq2Times New Roman Cyr;}{\\fdbminor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman Greek;}{\\fdbminor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman Tur;}
{\\fdbminor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman (Hebrew);}{\\fdbminor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman (Arabic);}{\\fdbminor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman Baltic;}
{\\fdbminor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman (Vietnamese);}{\\fhiminor\\f3\fbidi \fswiss\\fcharset2\fprq2Calibri CE;}{\\fhiminor\\f3\fbidi \fswiss\\fcharset2\fprq2Calibri Cyr;}
{\\fhiminor\\f3\fbidi \fswiss\\fcharset1\fprq2Calibri Greek;}{\\fhiminor\\f3\fbidi \fswiss\\fcharset1\fprq2Calibri Tur;}{\\fhiminor\\f3\fbidi \fswiss\\fcharset1\fprq2Calibri Baltic;}
{\\fhiminor\\f3\fbidi \fswiss\\fcharset1\fprq2Calibri (Vietnamese);}{\\fbiminor\\f3\fbidi \froman\\fcharset2\fprq2Times New Roman CE;}{\\fbiminor\\f3\fbidi \froman\\fcharset2\fprq2Times New Roman Cyr;}
{\\fbiminor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman Greek;}{\\fbiminor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman Tur;}{\\fbiminor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman (Hebrew);}
{\\fbiminor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman (Arabic);}{\\fbiminor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman Baltic;}{\\fbiminor\\f3\fbidi \froman\\fcharset1\fprq2Times New Roman (Vietnamese);}}
{\\colortbl;;\red0\green0\blue0;\red0\green0\blue2;\red0\green2\blue2;\red0\green2\blue0;\red2\green0\blue2;\red2\green0\blue0;\red2\green2\blue0;\red2\green2\blue2;\red0\green0\blue1;\red0\green1\blue1;\red0\green1\blue0;
\red1\green0\blue1;\red1\green0\blue0;\red1\green1\blue0;\red1\green1\blue1;\red1\green1\blue1;}{\\*\defchp \f3\fs2}{\\*\defpap \ql \li0\ri0\sa1\sl2\slmult1
\widctlpar\\wrapdefault\\aspalpha\\aspnum\\faauto\\adjustright\\rin0\lin0\itap0}\noqfpromote {\\stylesheet{{\\ql \li0\ri0\sa1\sl2\slmult1\widctlpar\\wrapdefault\\aspalpha\\aspnum\\faauto\\adjustright\\rin0\lin0\itap0\rtlch\\fcs1\af3\afs2\alang1
\ltrch\\fcs0\f3\fs2\lang1\langfe1\cgrid\\langnp1\langfenp1\snext0\sqformat \spriority0Normal;}{\\*\cs1\additive \ssemihidden \sunhideused \spriority1Default Paragraph Font;}{\\*
\ts1\tsrowd\\trftsWidthB3\trpaddl1\trpaddr1\trpaddfl3\trpaddft3\trpaddfb3\trpaddfr3\tblind0\tblindtype3\tsvertalt\\tsbrdrt\\tsbrdrl\\tsbrdrb\\tsbrdrr\\tsbrdrdgl\\tsbrdrdgr\\tsbrdrh\\tsbrdrv \ql \li0\ri0\sa1\sl2\slmult1
\widctlpar\\wrapdefault\\aspalpha\\aspnum\\faauto\\adjustright\\rin0\lin0\itap0\rtlch\\fcs1\af3\afs2\alang1\ltrch\\fcs0\f3\fs2\lang1\langfe1\cgrid\\langnp1\langfenp1\snext1\ssemihidden \sunhideused Normal Table;}{

\s1\ql \li7\ri0\sa1\sl2\slmult1\widctlpar\\wrapdefault\\aspalpha\\aspnum\\faauto\\adjustright\\rin0\lin7\itap0\contextualspace \rtlch\\fcs1\af3\afs2\alang1\ltrch\\fcs0\f3\fs2\lang1\langfe1\cgrid\\langnp1\langfenp1
\sbasedon0\snext1\sqformat \spriority3\styrsid1List Paragraph;}{\\*\ts1\tsrowd\\trbrdrt\\brdrs\\brdrw1\trbrdrl\\brdrs\\brdrw1\trbrdrb\\brdrs\\brdrw1\trbrdrr\\brdrs\\brdrw1\trbrdrh\\brdrs\\brdrw1\trbrdrv\\brdrs\\brdrw1
\trpaddl1\trpaddr1\trpaddfl3\trpaddft3\trpaddfb3\trpaddfr3\tblind0\tblindtype0\tsvertalt\\tsbrdrt\\tsbrdrl\\tsbrdrb\\tsbrdrr\\tsbrdrdgl\\tsbrdrdgr\\tsbrdrh\\tsbrdrv \ql \li0\ri0\widctlpar\\wrapdefault\\aspalpha\\aspnum\\faauto\\adjustright\\rin0\lin0\itap0
\rtlch\\fcs1\af0\afs2\alang1\ltrch\\fcs0\f3\fs2\lang1\langfe1\cgrid\\langnp1\langfenp1\sbasedon1\snext1\spriority3\styrsid1Table Grid;}}{\\*\listtable{{\\list\\listtemplateid6\listhybrid{{\\listlevel\\levelnfc0
\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat1\levelspace3\levelindent0{\\leveltext\\leveltemplateid6\'0\'0.;}{\\levelnumbers\\'0;}\rtlch\\fcs1\af0\ltrch\\fcs0\fi-3\li7\lin7}{\\listlevel\\levelnfc4\levelnfcn4\leveljc0\leveljcn0
\levelfollow0\levelstartat1\lvltentative\\levelspace3\levelindent0{\\leveltext\\leveltemplateid6\'0\'0.;}{\\levelnumbers\\'0;}\rtlch\\fcs1\af0\ltrch\\fcs0\fi-3\li1\lin1}{\\listlevel\\levelnfc2\levelnfcn2\leveljc2\leveljcn2\levelfollow0
\levelstartat1\lvltentative\\levelspace3\levelindent0{\\leveltext\\leveltemplateid6\'0\'0.;}{\\levelnumbers\\'0;}\rtlch\\fcs1\af0\ltrch\\fcs0\fi-1\li2\lin2}{\\listlevel\\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat1
\lvltentative\\levelspace3\levelindent0{\\leveltext\\leveltemplateid6\'0\'0.;}{\\levelnumbers\\'0;}\rtlch\\fcs1\af0\ltrch\\fcs0\fi-3\li2\lin2}{\\listlevel\\levelnfc4\levelnfcn4\leveljc0\leveljcn0\levelfollow0\levelstartat1\lvltentative

\levelspace3\levelindent0{\\leveltext\\leveltemplateid6\'0\'0.;}{\\levelnumbers\\'0;}\rtlch\\fcs1\af0\ltrch\\fcs0\fi-3\li3\lin3}{\\listlevel\\levelnfc2\levelnfcn2\leveljc2\leveljcn2\levelfollow0\levelstartat1\lvltentative\\levelspace3
\levelindent0{\\leveltext\\leveltemplateid6\'0\'0.;}{\\levelnumbers\\'0;}\rtlch\\fcs1\af0\ltrch\\fcs0\fi-1\li4\lin4}{\\listlevel\\levelnfc0\levelnfcn0\leveljc0\leveljcn0\levelfollow0\levelstartat1\lvltentative\\levelspace3\levelindent0
{\\leveltext\\leveltemplateid6\'0\'0.;}{\\levelnumbers\\'0;}\rtlch\\fcs1\af0\ltrch\\fcs0\fi-3\li5\lin5}{\\listlevel\\levelnfc4\levelnfcn4\leveljc0\leveljcn0\levelfollow0\levelstartat1\lvltentative\\levelspace3\levelindent0{\\leveltext

\leveltemplateid6\'0\'0.;}{\\levelnumbers\\'0;}\rtlch\\fcs1\af0\ltrch\\fcs0\fi-3\li5\lin5}{\\listlevel\\levelnfc2\levelnfcn2\leveljc2\leveljcn2\levelfollow0\levelstartat1\lvltentative\\levelspace3\levelindent0{\\leveltext

\leveltemplateid6\'0\'0.;}{\\levelnumbers\\'0;}\rtlch\\fcs1\af0\ltrch\\fcs0\fi-1\li6\lin6}{\\listname ;}\listid9}}{\\*\listoverridetable{

到目前为止,我已经知道了:

public String getContent(){           //my method
    String str = null;
    Pattern pattern = Pattern.compile("({\\\\\*\\listtable(.*\W)*){\\\\\*\\listoverridetable");
    Matcher matcher = pattern.matcher(bodyContent);

    if (matcher.find()) {
         str = matcher.group(2);
    }
    return str;
}

但如果我尝试将正则表达式实现为this,它总是会给我带来无效转义序列等错误。我应该如何在java中做到这一点?

3 个答案:

答案 0 :(得分:2)

你需要非常小心Java正则表达式中的转义。为了避免至少逃避其中的一部分,只需使用字符类[...]

以下是可与文字匹配的字符串:

String pattern = "(?s)[{]\\\\\\\\[*]\\\\listtable.*?(?=[{]\\\\\\\\[*]\\\\listoverridetable[{])";

要匹配文字\,您需要在正则表达式模式中使用4个\符号。

请注意(?s)单线修改器,它使点匹配换行符号。在正则表达式中,(.*\\W)*)在获取包含换行符的子字符串时非常无效。

请参阅IDEONE demo

答案 1 :(得分:0)

试试这个:

Pattern pattern = Pattern.compile("(\\{\\\\*\\listtable(.*\\W)*)\\{\\\\*\\listoverridetable");

在某些地方你必须逃离\

答案 2 :(得分:0)

查看您的链接,您没有在java中转义任何\

你需要做类似的事情,

Pattern pattern = Pattern.compile("({\\\\\\\\\\*\\\\listtable(.*\\W)*){\\\\\\\\\\*\\\\listoverridetable");

基本上,对于每个\,您需要在java中编写\\