正则表达式匹配部分和数据

时间:2016-03-14 15:20:29

标签: c# regex

我有一个包含不同部分的巨大文件(16),我已经粘贴了一小部分。我试图拆分这些部分,但似乎无法对"数据"

进行分组

我的Regex-patterb分割部分是(虽然这似乎只捕获第一部分部分我现在得到al部分,只是没有数据)

(?<Section>^Section.*$)

当我尝试为数据添加正则表达式部分时,它只匹配部分(根据regexstorm.net

^((?<Section>^Section.*$)
(?<Data>[ ]*))

我试图分割的文件是以下

Section 1 Who - All accounts: 1 record per account
Account number,Group name,Subgroup name,Customer name,Account ref1,Account ref2,Invoice name,Invoice number,Invoice date,Recurring amount,Occ,Commercial Credits,Discounts,Total excl. VAT,VAT,Total incl. VAT,Currency
"00001","Unfiled - Niet toegekend - Non attribué","","Reference1","gemeentesecretariaat","10408","Reference11","160600501092","11/FEB/2016",0.0,0.0,0.0,0.0,0.0,0.0,0.0,"EUR",
"00002","Unfiled - Niet toegekend - Non attribué","","Reference2","receptieve ruimten","76005","Reference21","160600433432","11/FEB/2016",0.0,-5.8393,0.0,4.4985,0.0,0.0,0.0,"EUR",

Section 14 Who - All subscribers - Data Volume: 1 record per Data session
GSM number,Group name,Subgroup name,Name GSM number,User ref1,User Ref2,Call date,Call time,Total volume (MB),Service,Zone/Country/Operator,Tariff,Type,Supplementary services,Usage amount,Currency,Account number
"0XXX/XXXXXX","Departement 1","Unfiled - Niet toegekend - Non attribué","Familyname","3000000","14","17/JAN/2016","14:42:12","0.1470","Mobile Internet","","Daluur","GPRS nationaal","",0.0,"EUR","25000000",
"0XXX/XXXXXX","Departement 1","Unfiled - Niet toegekend - Non attribué","Familyname","3000000","14","31/JAN/2016","19:55:08","0.3110","Mobile Internet","","Daluur","GPRS nationaal","",0.0,"EUR","25000000",

所以我们的目标是让每一个&#34; Section&#34;作为一个分组,其中的所有数据都作为数据组。为了分别解析每个部分,我需要首先拆分它,因为解析整个文件无论如何都不可能而不先拆分它:)

我用于拆分文件的代码如下:

public static void ReadFromSectionedCsv(this DataSet dataset, string filepath)
    {
        const string PATTERN = @"
 ^((?<Section>^Section.*$)
 (?<Data>[ ]*))";
        dataset.Clear();
        using (Stream filestream = File.Open(filepath, FileMode.Open))
        {
            Encoding encoding = Encoding.UTF8;
            string fileContetnt;
            using (StreamReader sr = new StreamReader(filestream, encoding))
            {
                fileContetnt = sr.ReadToEnd();
            }

            var match = Regex.Matches(fileContetnt, PATTERN,
                RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);
            foreach (Match m in match)
            {
                var sectionmatch = m.Groups["Section"];
                var datamatch = m.Groups["Data"];

                using (MemoryStream stream = new MemoryStream())
                using (StreamWriter writer = new StreamWriter(stream))
                {
                    writer.Write(datamatch.Value);
                    writer.Flush();
                    stream.Position = 0;
                    dataset.Tables.Add(sectionmatch.Value).ReadFromCsv(stream);
                }
            }
        }
    }

感谢先进的任何帮助!

3 个答案:

答案 0 :(得分:0)

您可以使用超前表达式((?=^Section|\Z))来匹配下一个Section或字符串结尾,还可以使用ungreedy Data消费((.|\n)*?)。

(?<MySection>^Section(.)+)(?<MyData>(.|\n)*?)(?=^Section|\Z)

答案 1 :(得分:0)

嗯,你有两种选择。

  

如果您的文件很大:

Regex regexObj = new Regex(@"^.*(?=(\r?\n)\1)|(?<=(\r?\n)).*",
RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Multiline);

Match matchResults = regexObj.Match(text);
while (matchResults.Success) {
    var section = matchResults.Value
    matchResults = matchResults.NextMatch();
} 

只需继续迭代匹配并相应处理。

  

如果合理的话。

string[] splitArray = null;
try {
    splitArray = Regex.Split(url, @"^\s*$", 
    RegexOptions.IgnoreCase | RegexOptions.Multiline);
} catch (ArgumentException ex) {
    // Syntax error in the regular expression
}

这会将部分拆分成数组。

答案 2 :(得分:0)

Try this one

(Section(?:.|\n)*?(?=Section|$))+