我正在尝试解析许多字段中包含双引号和逗号的CSV文件。我无法控制CSV的格式,而不是使用“”来转义它正在使用的引号\“。文件也非常大,所以阅读和使用正则表达式对我来说不是最佳选择。
我更喜欢使用现有的库并编写一个全新的解析器。目前我正在使用CSVHelper
这是CSV数据的示例:
“ID”, “名称”, “注释” “40”,“继续”,“如果消息\”继续\“未显示重启,请通知您的教师。” “41”,“重新启动”,“如果10秒后没有出现消息”“重启”,请手动重启。“
问题是双引号没有被正确转义,并且正在被读作分隔符并将notes字段分成2个单独的字段。
这是我目前无效的代码。
DataTable csvData = new DataTable();
string csvFilePath = @"C:\Users\" + csvFileName + ".csv";
try
{
FileInfo file = new FileInfo(csvFilePath);
using (TextReader reader = file.OpenText())
using (CsvReader csv = new CsvReader(reader))
{
csv.Configuration.Delimiter = ",";
csv.Configuration.HasHeaderRecord = true;
csv.Configuration.IgnoreQuotes = false;
csv.Configuration.TrimFields = true;
csv.Configuration.WillThrowOnMissingField = false;
string[] colFields = null;
while(csv.Read())
{
if (colFields == null)
{
colFields = csv.FieldHeaders;
foreach (string column in colFields)
{
DataColumn datacolumn = new DataColumn(column);
datacolumn.AllowDBNull = true;
csvData.Columns.Add(datacolumn);
}
}
string[] fieldData = csv.CurrentRecord;
for (int i = 0; i < fieldData.Length; i++)
{
if (fieldData[i] == "")
{
fieldData[i] = null;
}
}
csvData.Rows.Add(fieldData);
}
}
}
是否有现有的库可以指定如何转义引号,还是应该编写自己的解析器?
答案 0 :(得分:2)
使用非常简单的linq语句向split
和trim
以及最后Replace
使用内容中的unescaping引号时,您可以走得很远:
DataTable csvData = new DataTable();
string csvFilePath = @"C:\Users\" + csvFileName + ".csv";
try
{
string[] seps = { "\",", ",\"" };
char[] quotes = { '\"', ' ' };
string[] colFields = null;
foreach (var line in File.ReadLines(csvFilePath))
{
var fields = line
.Split(seps, StringSplitOptions.None)
.Select(s => s.Trim(quotes).Replace("\\\"", "\""))
.ToArray();
if (colFields == null)
{
colFields = fields;
foreach (string column in colFields)
{
DataColumn datacolumn = new DataColumn(column);
datacolumn.AllowDBNull = true;
csvData.Columns.Add(datacolumn);
}
}
else
{
for (int i = 0; i < fields.Length; i++)
{
if (fields[i] == "")
{
fields[i] = null;
}
}
csvData.Rows.Add(fields);
}
}
}
在非常简单的控制台应用程序中使用,并在“test.txt”文件中使用OP原始输入:
public static void CsvUnescapeSplit()
{
string[] seps = { "\",", ",\"" };
char[] quotes = { '\"', ' ' };
foreach (var line in File.ReadLines(@"c:\temp\test.txt"))
{
var fields = line
.Split(seps, StringSplitOptions.None)
.Select(s => s.Trim(quotes).Replace("\\\"", "\""))
.ToArray();
foreach (var field in fields)
Console.Write("{0} | ", field);
Console.WriteLine();
}
}
这会产生以下(正确的)输出:
id | name | notes |
40 | Continue | If the message "Continue" does not appear restart, and notify your instructor. |
41 | Help | If the message "Restart" does not appear after 10 seconds, manually restart. |
警告:如果您的字段分隔符包含空格,请执行以下操作:
"40" , "Continue" , "If the message \"Continue\" does not appear restart, and notify your instructor."
或者您的内容字符串在引用后直接包含逗号,如此处(在“重新启动”之后):
"41","Help","If the message \"Restart\", does not appear after 10 seconds, manually restart."
它会失败。