我必须编写(或使用现有的)csv解析库。
问题是文件以不同的格式上传,并带有不同的分隔符号,例如:
File1:
field1; field2; field3; field4
field1; field2; field3; field4
File2:
feld1, field2, field3, field4
feld1, field2, field3, field4
File3:
"field1", "field2", "field3", "field4"
"field1", "field2", "field3", "field4"
程序化地理解哪个符号是列的实际分隔符的最佳方法是什么?
我正在考虑用符号统计分析编写自己的方法,但也许有现有的解决方案?
答案 0 :(得分:1)
我会使用正则表达式(希望不要像上次那样获得尽可能多的downvotes;))。我正在利用backreferences,它基本上允许使用以前捕获的组。只要每行使用相同的分隔符(不知道它是否有用),您也可以在同一文件中使用不同的分隔符。
所以,这就是我构建正则表达式的方式:
string csvItem = @"[""']?\w+[""']?";
string separator = @"\s*[,\.;-]\s*";
string pattern = string.Format(@"^({0}(?<sep>{1}){0})+(\k<sep>{0})*$",
csvItem, separator);
csvItem是csv中的项目(列)。它可以包含小写或大写字母,数字和下划线,并且可以选择用“或”包围。
分隔符分隔项目。它由以下字符之一,。; - 和零个或多个间距字符组成。
模式表示有效行由至少两个由分隔符分隔的csvItem组成。注意反向引用 - &gt; \ķ
在这里。这是测试文件的内容:
field1; field2; field3; field4
field1; field2; field3; field4
feld1, field2, field3, field4
feld1, field2, field3, field4
"field1", "field2", "field3", "field4"
"field1", "field2", "field3", "field4"
示例控制台项目:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Text.RegularExpressions;
namespace csvParser {
class Program {
static void Main( string[ ] args ) {
var lines = File.ReadAllLines( @"e:\prova.csv" );
for ( int i = 0; i < lines.Length; i++ ) {
string csvItem = @"[""']?\w+[""']?";
string separator = @"\s*[,\.;-]\s*";
string pattern = string.Format(@"^({0}(?<sep>{1}){0})+(\k<sep>{0})*$", csvItem, separator);
var rex = new Regex( pattern, RegexOptions.Singleline );
var match = rex.Match( lines[ i ] );
if ( match == null ) {
Console.WriteLine( "No match on line {0}", i );
continue;
}
else {
string sep = match.Groups[ "sep" ].Value;
Console.WriteLine( "--- Line #{0} ---------------", i );
Console.WriteLine( "Line is '{0}'", lines[ i ] );
Console.WriteLine( "Separator is '{0}'", sep );
Console.WriteLine( "Items are:" );
foreach ( string item in lines[ i ].Split( sep ) )
Console.WriteLine( "\t'{0}'", item );
Console.WriteLine( );
}
}
Console.ReadKey( );
}
}
public static partial class Extension {
public static string[ ] Split( this string str, string sep ) {
return str.Split( new string[ ] { sep }, StringSplitOptions.RemoveEmptyEntries );
}
}
}
最后输出:
--- Line #0 ---------------
Line is 'field1; field2; field3; field4'
Separator is '; '
Items are:
'field1'
'field2'
'field3'
'field4'
--- Line #1 ---------------
Line is 'field1; field2; field3; field4'
Separator is '; '
Items are:
'field1'
'field2'
'field3'
'field4'
--- Line #2 ---------------
Line is ''
Separator is ''
Items are:
--- Line #3 ---------------
Line is 'feld1, field2, field3, field4'
Separator is ', '
Items are:
'feld1'
'field2'
'field3'
'field4'
--- Line #4 ---------------
Line is 'feld1, field2, field3, field4'
Separator is ', '
Items are:
'feld1'
'field2'
'field3'
'field4'
--- Line #5 ---------------
Line is ''
Separator is ''
Items are:
--- Line #6 ---------------
Line is '"field1", "field2", "field3", "field4"'
Separator is ', '
Items are:
'"field1"'
'"field2"'
'"field3"'
'"field4"'
--- Line #7 ---------------
Line is '"field1", "field2", "field3", "field4"'
Separator is ', '
Items are:
'"field1"'
'"field2"'
'"field3"'
'"field4"'
不幸的是,正则表达式也会捕获空行。试图解决它:)