我的RTF解析器需要处理两种类型的rtf文件(每个程序执行一个文件):从Word保存的rtf文件和由COTS报表生成器实用程序创建的rtf文件。每个的rtf有效,但是不同。我的解析器使用正则表达式模式来检测,提取和处理两种类型的rtf文件中的各种rtf元素。
我决定在两个字典中实现rtf regex模式列表,一个字典用于Word rtf文件所需的rtf regex模式,另一个用于COTS实用程序rtf文件所需的rtf regex模式。在运行时,我的解析器检测到正在处理哪种类型的rtf文件(Word rtf包含rtf元素//schemas.microsoft.com/office/word
,而COTS rtf则没有),然后从适当的字典中获取所需的regex模式。
为了简化在编写代码时引用模式的任务,我实现了一个枚举,其中每个枚举值代表一个特定的regex模式。为了简化使模式与其对应的枚举保持同步的任务,我将正则表达式模式实现为here-string
,其中每行都是csv组成:{enum name}, {word rtf regex pattern}, {cots rtf regex pattern}
。然后,在将模式加载到其字典中的运行时,我从csv获取枚举的int值,并使用它来创建字典键。
这使编写代码更加容易,但是我不确定这是实现和引用rtf表达式的最佳方法。有没有更好的办法?
示例代码:
public enum Rex {FOO, BAR};
string ex = @"FOO, word rtf regex pattern for FOO, cots rtf regex pattern for FOO
BAR, word rtf regex pattern for BAR, cots rtf regex pattern for BAR
";
我这样加载字典:
using (StringReader reader = new StringReader(ex))
{
string line;
while ((line = reader.ReadLine()) != null)
{
string[] splitLine = line.Split(',');
int enumIntValue = (int)(Rex)Enum.Parse(typeof(Rex), splitLine[0].Trim());
ObjWordRtfDict.Add(enumIntValue, line.Split(',')[1].Trim());
ObjRtfDict.Add(enumIntValue, line.Split(',')[2].Trim());
}
}
然后,在运行时,我根据解析器检测到的rtf文件的类型访问ObjWordRtfDict或ObjRtfDict。
string regExPattFoo = ObjRegExExpr.GetRegExPattern(ClsRegExExpr.Rex.FOO);
public string GetRegExPattern(Rex patternIndex)
{
string regExPattern = "";
if (isWordRtf)
{
ObjWordRtfDict.TryGetValue((int)patternIndex, out regExPattern);
}
else
{
ObjRtfDict.TryGetValue((int)patternIndex, out regExPattern);
}
return regExPattern;
}
根据Asif的建议修改的新代码
我保留了模式名称的枚举,以便编译器可以检查对模式名称的引用
包含为嵌入式资源的示例csv文件
SECT,^\\pard.*\{\\rtlch.*\\sect\s\}, ^\\pard.*\\sect\s\}
HORZ_LINE2, \{\\pict.*\\pngblip, TBD
用法示例
string sectPattern = ObjRegExExpr.GetRegExPattern(ClsRegExPatterns.Names.SECT);
ClsRegExPatterns类
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Text;
using System.Text.RegularExpressions;
namespace foo
{
public class ClsRegExPatterns
{
readonly bool isWordRtf = false;
List<ClsPattern> objPatternList;
public enum Names { SECT, HORZ_LINE2 };
public class ClsPattern
{
public string Name { get; set; }
public string WordRtfRegex { get; set; }
public string COTSRtfRegex { get; set; }
}
public ClsRegExPatterns(StringBuilder rawRtfTextFromFile)
{
// determine if input file is Word rtf or not Word rtf
if ((Regex.Matches(rawRtfTextFromFile.ToString(), "//schemas.microsoft.com/office/word", RegexOptions.IgnoreCase)).Count == 1)
{
isWordRtf = true;
}
//read patterns from embedded content csv file
string patternsAsCsv = new StreamReader((Assembly.GetExecutingAssembly()).GetManifestResourceStream("eLabBannerLineTool.Packages.patterns.csv")).ReadToEnd();
//create list to hold patterns
objPatternList = new List<ClsPattern>();
//load pattern list
using (StringReader reader = new StringReader(patternsAsCsv))
{
string line;
while ((line = reader.ReadLine()) != null)
{
string[] splitLine = line.Split(',');
ClsPattern objPattern = new ClsPattern
{
Name = splitLine[0].Trim(),
WordRtfRegex = splitLine[1].Trim(),
COTSRtfRegex = splitLine[2].Trim()
};
objPatternList.Add(objPattern);
}
}
}
public string GetRegExPattern(Names patternIndex)
{
string regExPattern = "";
string patternName = patternIndex.ToString();
if (isWordRtf)
{
regExPattern = objPatternList.SingleOrDefault(x => x.Name == patternName)?.WordRtfRegex;
}
else
{
regExPattern = objPatternList.SingleOrDefault(x => x.Name == patternName)?.COTSRtfRegex;
}
return regExPattern;
}
}
}
答案 0 :(得分:1)
如果我正确理解您的问题陈述;我更喜欢下面这样的东西。
创建一个名为RtfProcessor的类
public class RtfProcessor
{
public string Name { get; set; }
public string WordRtfRegex { get; set; }
public string COTSRtfRegex { get; set; }
void ProcessFile()
{
throw new NotImplementedException();
}
}
其中名称表示FOO或BAR等。您可以维护此类文件的列表,并继续从如下的csv文件填充
List<RtfProcessor> fileProcessors = new List<RtfProcessor>();
using (StringReader reader = new StringReader(ex))
{
string line;
while ((line = reader.ReadLine()) != null)
{
string[] splitLine = line.Split(',');
RtfProcessor rtfProcessor = new RtfProcessor();
rtfProcessor.Name = splitLine[0].Trim();
rtfProcessor.WordRtfRegex = line.Split(',')[1].Trim();
rtfProcessor.WordRtfRegex = line.Split(',')[2].Trim();
fileProcessors.Add(rtfProcessor);
}
}
并检索FOO或BAR的正则表达式模式
// to get the regex parrtern for FOO you can use
fileProcessors.SingleOrDefault(x => x.Name == "FOO")?.WordRtfRegex;
希望这会有所帮助。