我正在研究一个文档解析器来从我给出的一些文档中提取数据,并且我用C#编码。文件格式如下:
(Type 1): (potentially multi-lined string)
(Type 2): (potentially multi-lined string)
(Type 3): (potentially multi-lined string)
...
(Type N): (potentially multi-lined string)
(Type 1): (potentially multi-lined string)
...
End Of Document.
文件以相同的格式重复(类型1) - (类型N)M次(
我遇到多线字符串问题并找到(类型1)的最后一次迭代 - (类型N)
我需要做的是捕获由其前面命名的组中的(可能是多线的字符串)(类型#)
以下是我想要匹配的文档片段:
Name: John Dow Position: VP. over Development Bio: Here is a really long string of un important stuff that could include words like "Bio" or "Name". Some times I have problems here, but for the most part it should be normal Bio information Position History: Vp. over Development Sr. Project Manager Jr. Project Manager Developer Peon Notes: Here are some notes that may or may not be multilined and if it is, all the lines need to be captured for this person. Name: Joe Noob Position: Peon Bio: I'm a peon, so I have little bio Position History: Peon Notes: few notes Name: Jane Smith Position: VP. over Sales Bio: Here is a really long string of more un important stuff that could include words like "Bio" or "Name". Some times I have problems here, but for the most part it should be normal Bio information Position History: Vp. over Sales Sales Manager Secretary Notes: Here are some notes that may or may not be multilined and if it is, all the lines need to be captured for this person.
(#类型)的顺序始终相同,并且它们始终以换行符开头。
我有什么:
Name:\s(?:(?.*?)\r\n)+?Position:\s(?:(?.*?)\r\n)+?Bio:\s(?:(?.*?)\r\n)+?Position History:\s(?:(?.*?)\r\n)+?Notes:\s(?:(?.*?)\r\n)+?
任何帮助都会很棒!
答案 0 :(得分:3)
因为你正在使用延迟匹配,所以最后一个令牌只需要它。你可以通过在模式的末尾添加一个lookahed来解决这个问题,以匹配到下一个标记:
(?=^Name:|$)
这是完整的正则表达式:
Name:\s(?:(.*?)\s+)Position:\s(?:(.*?)\s+)Bio:\s(?:(.*?)\s+)Position History:\s(?:(.*?)\s+)Notes:\s(?:(.*?)\s+)(?=^Name:|$)
示例:http://regexhero.net/tester/?id=92982feb-806f-4d0a-96a3-5ef6689a0e01
答案 1 :(得分:2)
试试这个:
(?'tag'[\w\s]+):\s*(?'val'.*([\r\n][^:]*)*)
我只是把':'前面的标签作为命名组'tag',并将(潜在的)多行文本作为值。
答案 2 :(得分:2)
最简单的解决方法是从右到左模式进行匹配:
Regex r = new Regex(@"Name:\s(?:(.*?)\r\n)+?" +
@"Position:\s(?:(.*?)\r\n)+?" +
@"Bio:\s(?:(.*?)\r\n)+?" +
@"Position History:\s(?:(.*?)\r\n)+?" +
@"Notes:\s(?:(.*?)\r\n)+?",
RegexOptions.Singleline | RegexOptions.RightToLeft);
顺便说一句,我不得不删除一堆不合适的问号以使其完全正常工作。你确实希望这些群体捕获,不是吗?