Question

我有大文本文件，看起来像这样：

   001 00 *f a *a 0 014 017 1 *d 19740918 *c 19890127 *b 718500
    004 00 *a e *r c
    008 00 *t m *v 1 *u f *a 1974 *b dk *l dan
    009 00 *a a *g xx
    021 00 *a 87-7492-095-2 *d kr 69.00
    032 00 *a DBR197075
    041 00 *a dan *d eng
    100 00 *a Kêœ³rsted *h Tage
    245 00 *a Storbritannien og Danmark 1914-1920
    260 00 *b Odense Universitetsforlag *c 1974
    300 00 *a 240 sider *b ill
    440 00 *a Odense University studies in history and social sciences *v 17
    534 00 *a Med engelsk resumÃ©
    652 00 *m 93.6
    652 00 *Ã¥ 1 *p 96.7
    710 00 *a Odense Universitet *x se *w Odense University studies in history and social sciences
    e01 00 *a illegal subfield x present in 710
    e01 00 *a illegal subfield w present in 710

    001 00 *f a *a 0 014 573 4 *d 19741030 *c 19871230 *b 718500
    004 00 *a e *r c
    008 00 *t m *v 1 *u u *a 1974 *b dk *l dan
    009 00 *a a *g xx
    021 00 *a 87-422-7944-5 *d kr 49.00
    032 00 *a DBR197075 *x NDF190104
    245 00 *a De Â¤politiske partier *Ã¸ Ved Poul MÃ¸ller
    250 00 *a Ny udgave *c redigeret af \Poul MÃ¸ller\ *d udgivet af Socialpolitisk Forening
    260 00 *b Det Danske Forlag *c 1974
    440 00 *a Nyt socialt bibliotek
    652 00 *m 32.26
    700 00 *a MÃ¸ller *h Poul *c f. 1919
    710 00 *a Socialpolitisk Forening *x se *w Nyt socialt bibliotek
    e01 00 *a illegal subfield x present in 710
    e01 00 *a illegal subfield w present in 710

    001 00 *f a *a 0 014 691 9 *d 19741030 *c 19871018 *b 718500
    004 00 *a e *r c
    008 00 *t m *v 1 *u f *a 1973 *b dk *l dan
    009 00 *a a *g xx
    021 00 *a 87-7472-020-1 *d kr 27.60
    032 00 *a DBR197075
    110 00 *a Landsarkivet for SjÃ¦lland m. m
    245 00 *a Oversigt over Landsarkivets samling af kort og tegninger indtil ca. 1900
    260 00 *c 1973
    300 00 *a 6, 207 sider
    440 00 *a ForelÃ¸bige arkivregistraturer / udgivet af Landsarkivet for SjÃ¦lland m. m.
    652 00 *m 02.91
    710 00 *a Landsarkivet for SjÃ¦lland m. m. *x se *w ForelÃ¸bige arkivregistraturer / udgivet af Landsarkivet for SjÃ¦lland m. m.
    e01 00 *a illegal subfield x present in 710
    e01 00 *a illegal subfield w present in 710

我需要捕获每个组，以便我可以单独处理错误。

每个小组都以＆＃34; 001＆＃34;并在下一个空行结束（所以我在上面的例子中有三个组）。

使用像001（（。| \ n）*）（？= 001）这样的天真模式对http://regexstorm.net/进行了一些测试，但从未让它起作用。我之前从未使用过RegEx，我必须承认它看起来有点难以使用，但从我所听到的情况来看，这是打破半复杂文本文件时的方法。

文本编码显然有问题。因此，一些文本是乱码，上帝知道像换行和内插线这样的东西（Notepad ++说CRLF）。

正则表达式应该是这样的：

匹配以＆＃34; 001＆＃34;开头的所有文本组并在下一个＆＃34; 001＆＃34;之前结束或者在第一个空行。

大多数文本文件都有大约50000多个此类组。

我正在使用c＃所以使用该语言的任何建议都会很好，但我认为这是一种奢侈，我理解大多数正则表达式可以用于支持它们的每种语言，几乎不需要修改

任何帮助将不胜感激：）

蒂亚。

Answer 1

你可以试试

.Split(new string[] { "\r\n001" }, StringSplitOptions.RemoveEmptyEntries)

这可能比正则表达式更容易。

Answer 2

好吧，你可以使用正则表达式，它会是这样的：

(?s)\b001\b.*?(?:\n\s*?\n|$)

Demo

(?s)：激活RegexOptions.Singleline（如果在模式选项中包含此标记，则可以省略它）
\b001\b：001整个单词，\b是一个单词边界
.*?：任何事情（不合适的匹配，尝试尽可能少的字符
或者：
- \n\s*?\n：换行符后跟可选空格，然后是另一个换行符
- $：或字符串的结尾

呼。那个正则表达式 复杂，但如果你进行逐行处理，你的代码可能会更简单：

从line.TrimStart().StartsWith("001 ")开始，或与^\s*001\b
匹配
当您获得string.IsNullOrWhiteSpace(line)
时停止

解析器通常被实现为状态机，这符合该方案这对我来说似乎更简单，另外你不必一次将整个东西加载到内存中：）

Answer 3

你可以去：

^001           # look for 001 at the very beginning of a line
(?s:.+?)(?=^$) # turn on the single line mode
               # match anything lazily
               # make sure the positive lookahead (?=) matches an empty line

在multiline模式下，请参阅a demo on regex101.com（并注意不同的修饰符！）

Answer 4

C＃中{p> CR为\r，LF为\n。

如果我们使用双端线作为分隔符，它将像这样工作

string[] delimiter = new string[] {"\r\n\r\n"};
string[] SplitString = s.Split(delimiter, StringSplitOptions.None);

s是最初的巨大字符串结果是一个字符串数组，其中每个元素都是以001开头并以\r\n\r\n结尾的字符串（双端行）。

Answer 5

这是一个非常简单的解决方案，只要每组开始一个空白行。如果需要，我可以提供更好的解决方案。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;


namespace ConsoleApplication85
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.txt";
        static void Main(string[] args)
        {
            string inputLine = "";
            StreamReader reader = new StreamReader(FILENAME);
            List<string> group = null;
            while ((inputLine = reader.ReadLine()) != null)
            {
                inputLine = inputLine.Trim();
                if (inputLine.Length == 0)
                {
                    if (group != null)
                    {
                        ProcessData(group);
                        group = null;
                    }
                }
                else
                {
                    if (group == null)
                    {
                        group = new List<string>();
                    }
                    group.Add(inputLine);
                }
            }
            //process last group if there wasn't a blank line at end of group
            if (group != null) ProcessData(group);

        }
        static void ProcessData(List<string> data)
        {
            int a = 0;
        }

    }
}

这是一个解决方案，其中新的行不在组

之间

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;


namespace ConsoleApplication85
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.txt";
        static void Main(string[] args)
        {
            string inputLine = "";
            StreamReader reader = new StreamReader(FILENAME);
            List<string> group = null;
            while ((inputLine = reader.ReadLine()) != null)
            {
                inputLine = inputLine.Trim();
                Boolean newGroup = inputLine.StartsWith("001 ");
                if ((inputLine.Length == 0) || newGroup)
                {
                    if (group != null)
                    {
                        ProcessData(group);
                        group = null;
                    }
                    if (newGroup)
                    {
                        group = new List<string>();
                        group.Add(inputLine);
                    }
                }
                else
                {
                    if (group == null)
                    {
                        group = new List<string>();
                    }
                    group.Add(inputLine);
                }
            }
            //process last group if there wasn't a blank line at end of group
            if (group != null) ProcessData(group);

        }
        static void ProcessData(List<string> data)
        {
            int a = 0;
        }

    }
}

Answer 6

解决问题的一种正则表达方式：

        string body = @"""001 00 *f a *a 0 014 017 1 *d 19740918 *c 19890127 *b 718500
004 00 *a e *r c
008 00 *t m *v 1 *u f *a 1974 *b dk *l dan
009 00 *a a *g xx
021 00 *a 87-7492-095-2 *d kr 69.00
032 00 *a DBR197075
041 00 *a dan *d eng
100 00 *a Kêœ³rsted *h Tage
245 00 *a Storbritannien og Danmark 1914-1920
260 00 *b Odense Universitetsforlag *c 1974
300 00 *a 240 sider *b ill
440 00 *a Odense University studies in history and social sciences *v 17
534 00 *a Med engelsk resumÃ©
652 00 *m 93.6
652 00 *Ã¥ 1 *p 96.7
710 00 *a Odense Universitet *x se *w Odense University studies in history and social sciences
e01 00 *a illegal subfield x present in 710
e01 00 *a illegal subfield w present in 710

001 00 *f a *a 0 014 573 4 *d 19741030 *c 19871230 *b 718500
004 00 *a e *r c
008 00 *t m *v 1 *u u *a 1974 *b dk *l dan
009 00 *a a *g xx
021 00 *a 87-422-7944-5 *d kr 49.00
032 00 *a DBR197075 *x NDF190104
245 00 *a De Â¤politiske partier *Ã¸ Ved Poul MÃ¸ller
250 00 *a Ny udgave *c redigeret af \Poul MÃ¸ller\ *d udgivet af Socialpolitisk Forening
260 00 *b Det Danske Forlag *c 1974
440 00 *a Nyt socialt bibliotek
652 00 *m 32.26
700 00 *a MÃ¸ller *h Poul *c f. 1919
710 00 *a Socialpolitisk Forening *x se *w Nyt socialt bibliotek
e01 00 *a illegal subfield x present in 710
e01 00 *a illegal subfield w present in 710

001 00 *f a *a 0 014 691 9 *d 19741030 *c 19871018 *b 718500
004 00 *a e *r c
008 00 *t m *v 1 *u f *a 1973 *b dk *l dan
009 00 *a a *g xx
021 00 *a 87-7472-020-1 *d kr 27.60
032 00 *a DBR197075
110 00 *a Landsarkivet for SjÃ¦lland m. m
245 00 *a Oversigt over Landsarkivets samling af kort og tegninger indtil ca. 1900
260 00 *c 1973
300 00 *a 6, 207 sider
440 00 *a ForelÃ¸bige arkivregistraturer / udgivet af Landsarkivet for SjÃ¦lland m. m.
652 00 *m 02.91
710 00 *a Landsarkivet for SjÃ¦lland m. m. *x se *w ForelÃ¸bige arkivregistraturer / udgivet af Landsarkivet for SjÃ¦lland m. m.
e01 00 *a illegal subfield x present in 710
e01 00 *a illegal subfield w present in 710""";

        var pattern = @"(\s*001[\s\S]+?)(?=[^\n]+001\s+|$)";
        var result1 = Regex.Matches(body, pattern).Cast<Match>().ToList().ConvertAll(m => m.Groups[1].Value);

尝试使用正则表达式捕获重复的文本块（在c＃中）

6 个答案: