我有一个包含不同部分的巨大文件(16),我已经粘贴了一小部分。我试图拆分这些部分,但似乎无法对"数据"
进行分组我的Regex-patterb分割部分是(虽然这似乎只捕获第一部分部分我现在得到al部分,只是没有数据)
(?<Section>^Section.*$)
当我尝试为数据添加正则表达式部分时,它只匹配部分(根据regexstorm.net)
^((?<Section>^Section.*$)
(?<Data>[ ]*))
我试图分割的文件是以下
Section 1 Who - All accounts: 1 record per account
Account number,Group name,Subgroup name,Customer name,Account ref1,Account ref2,Invoice name,Invoice number,Invoice date,Recurring amount,Occ,Commercial Credits,Discounts,Total excl. VAT,VAT,Total incl. VAT,Currency
"00001","Unfiled - Niet toegekend - Non attribué","","Reference1","gemeentesecretariaat","10408","Reference11","160600501092","11/FEB/2016",0.0,0.0,0.0,0.0,0.0,0.0,0.0,"EUR",
"00002","Unfiled - Niet toegekend - Non attribué","","Reference2","receptieve ruimten","76005","Reference21","160600433432","11/FEB/2016",0.0,-5.8393,0.0,4.4985,0.0,0.0,0.0,"EUR",
Section 14 Who - All subscribers - Data Volume: 1 record per Data session
GSM number,Group name,Subgroup name,Name GSM number,User ref1,User Ref2,Call date,Call time,Total volume (MB),Service,Zone/Country/Operator,Tariff,Type,Supplementary services,Usage amount,Currency,Account number
"0XXX/XXXXXX","Departement 1","Unfiled - Niet toegekend - Non attribué","Familyname","3000000","14","17/JAN/2016","14:42:12","0.1470","Mobile Internet","","Daluur","GPRS nationaal","",0.0,"EUR","25000000",
"0XXX/XXXXXX","Departement 1","Unfiled - Niet toegekend - Non attribué","Familyname","3000000","14","31/JAN/2016","19:55:08","0.3110","Mobile Internet","","Daluur","GPRS nationaal","",0.0,"EUR","25000000",
所以我们的目标是让每一个&#34; Section&#34;作为一个分组,其中的所有数据都作为数据组。为了分别解析每个部分,我需要首先拆分它,因为解析整个文件无论如何都不可能而不先拆分它:)
我用于拆分文件的代码如下:
public static void ReadFromSectionedCsv(this DataSet dataset, string filepath)
{
const string PATTERN = @"
^((?<Section>^Section.*$)
(?<Data>[ ]*))";
dataset.Clear();
using (Stream filestream = File.Open(filepath, FileMode.Open))
{
Encoding encoding = Encoding.UTF8;
string fileContetnt;
using (StreamReader sr = new StreamReader(filestream, encoding))
{
fileContetnt = sr.ReadToEnd();
}
var match = Regex.Matches(fileContetnt, PATTERN,
RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline);
foreach (Match m in match)
{
var sectionmatch = m.Groups["Section"];
var datamatch = m.Groups["Data"];
using (MemoryStream stream = new MemoryStream())
using (StreamWriter writer = new StreamWriter(stream))
{
writer.Write(datamatch.Value);
writer.Flush();
stream.Position = 0;
dataset.Tables.Add(sectionmatch.Value).ReadFromCsv(stream);
}
}
}
}
感谢先进的任何帮助!
答案 0 :(得分:0)
您可以使用超前表达式((?=^Section|\Z)
)来匹配下一个Section
或字符串结尾,还可以使用ungreedy Data
消费((.|\n)*?
)。
(?<MySection>^Section(.)+)(?<MyData>(.|\n)*?)(?=^Section|\Z)
答案 1 :(得分:0)
嗯,你有两种选择。
如果您的文件很大:
Regex regexObj = new Regex(@"^.*(?=(\r?\n)\1)|(?<=(\r?\n)).*",
RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Multiline);
Match matchResults = regexObj.Match(text);
while (matchResults.Success) {
var section = matchResults.Value
matchResults = matchResults.NextMatch();
}
只需继续迭代匹配并相应处理。
如果合理的话。
string[] splitArray = null;
try {
splitArray = Regex.Split(url, @"^\s*$",
RegexOptions.IgnoreCase | RegexOptions.Multiline);
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
这会将部分拆分成数组。
答案 2 :(得分:0)
(Section(?:.|\n)*?(?=Section|$))+