我如何处理解析错误的csv数据?

时间:2016-08-29 19:55:56

标签: c# csv parsing malformed textfieldparser

我知道数据应该是正确的。我无法控制数据,我的老板只是告诉我,我需要找到一种方法来处理别人的错误。所以请不要告诉我,数据不好也不是我的问题,因为它是。

Anywho,这就是我所看到的:

"Words","email@email.com","","4253","57574","FirstName","","LastName, MD","","","576JFJD","","1971","","Words","Address","SUITE "A"","City","State","Zip","Phone","",""

出于保密原因,数据已被清除。

如您所见,数据包含引号,其中一些引用字段中有逗号。所以我无法删除它们。但是"套房A"""正在抛弃解析器。引号太多了。 >。<

我使用这些设置在Microsoft.VisualBasic.FileIO名称空间中使用TextFieldParser:

            parser.HasFieldsEnclosedInQuotes = true;
            parser.SetDelimiters(",");
            parser.TextFieldType = FieldType.Delimited;

错误是

  

MalformedLineException:无法使用当前行解析行9871   分隔符。

我想以某种方式清理数据以解决这个问题,但我不知道该怎么做。或者也许有一种方法可以跳过这一行?虽然我怀疑我的高层不会批准我只是跳过我们可能需要的数据。

6 个答案:

答案 0 :(得分:2)

我不熟悉TextFieldParser。但是使用CsvHelper,您可以为无效数据添加自定义处理程序:

var config = new CsvConfiguration();
config.IgnoreReadingExceptions = true;
config.ReadingExceptionCallback += (e, row) =>
{
    // you can add some custom patching here if possible
    // or, save the line numbers and add/edit them manually later.
};

using(var file = File.OpenRead(".csv"))
using(var reader = new CsvReader(reader, config))
{
    reader.GetRecords<YourDtoClass>();
}

答案 1 :(得分:2)

如果您只想摆脱csv中的迷路"标记,可以使用以下正则表达式找到它们并将其替换为'

String sourcestring = "source string to match with pattern";
String matchpattern = @"(?<!^|,)""(?!(,|$))";
String replacementpattern = @"$1'";
Console.WriteLine(Regex.Replace(sourcestring,matchpattern,replacementpattern,RegexOptions.Multiline));

说明:

@"(?<!^|,)""(?!(,|$))";会发现找不到字符串开头之前的任何",,并且后面没有字符串的结尾或{ {1}}

答案 2 :(得分:1)

我之前必须这样做,

第一步是使用string.split(',')

解析数据

下一步是合并属于一起的细分。

我基本上做的是

  • 制作一个代表组合字符串的新列表
  • 如果字符串以引号开头,则将其推送到新列表
  • 如果它不是以引号开头,则将其附加到列表中的最后一个字符串
  • 奖励:当字符串以引号结尾但下一个字符串不以引号
  • 开头时抛出异常

根据有关数据实际显示内容的规则,您可能需要更改代码才能解决问题。

答案 3 :(得分:1)

CSV's file format的核心,每一行都是一行,该行中的每个单元格都用逗号分隔。在您的情况下,您的格式还包含(非常不幸)的规定,即一对引号内的逗号不算作分隔符,而是数据的一部分。我说非常不幸,因为错误的引号会影响整行的其余部分,并且由于标准ASCII中的引号不区分开放和关闭,因此在不知道原始意图的情况下,您无法从中恢复。< / p>

当您以某种方式记录消息时, 知道原始意图的人(提供数据的人)可以查看该文件并更正错误:

if (parse_line(line, &data)) {
   // save the data
} else {
   // log the error
   fprintf(&stderr, "Bad line: %s", line);
}

由于您的引号没有转义换行符,因此您可以在遇到此错误后继续使用下一行。

ADDENDUM:如果贵公司有选择(即您的数据是通过公司工具序列化的),请不要使用CSV。使用类似XML或JSON的东西,使用更明确定义的解析机制。

答案 4 :(得分:1)

我对每个人所说的唯一补充(因为我们都在那里)是试图纠正您遇到的每个新问题。有一些不错的REGEX字符串https://www.google.com/?ion=1&espv=2#q=c-sharp+regex+csv+clean或者您可以使用String.Replace手动修复内容(String.Replace(&#34; \&#34; \&#34; \&#34;&#34) ;,&#34;&#34;。)更换(&#34; \&#34; \&#34;&#34;&#34)替换(&#34; \&#34; ,,&#34;,&#34; \&#34;,&#34;)或其他)。最终,当您检测并找到纠正越来越多错误的方法时,您的手动恢复率将会大幅降低(大多数不良数据可能来自类似的错误)。干杯!

PS - Idea-ish(它已经有一段时间了 - 逻辑可能需要一些调整,因为我是从记忆中写的),但你会得到要点:

public string[] parseCSVWithQuotes(string csvLine,int expectedNumberOfDataPoints)
    {
        string ret = "";
        string thisChar = "";
        string lastChar = "";
        bool needleDown = true;
        for(int i = 0; i < csvLine.Length; i++)
        {
            thisChar = csvLine.Substring(i, 1);
            if (thisChar == "'"&&lastChar!="'")
                needleDown = needleDown == true ? false : true;//when needleDown = true, characters are treated literally
            if (thisChar == ","&&lastChar!=",") {
                if (needleDown)
                {
                    ret += "|";//convert literal comma to pipe so it doesn't cause another break on split
                }else
                {
                    ret += ",";//break on split is intended because the comma is outside the single quote
                }
            }
            if (!needleDown && (thisChar == "\"" || thisChar == "*")) {//repeat for any undesired character or use RegEx
                                                                       //do not add -- this eliminates any undesired characters outside single quotes
            }
            else
            {
                if ((lastChar == "'" || lastChar == "\"" || lastChar == ",") && thisChar == lastChar)
                {
                    //do not add - this eliminates double characters
                }else
                {
                    ret += thisChar;
                    lastChar = thisChar;
                    //this character is not an undesired character, is no a double, is valid.
                }
            }
        }
        //we've cleaned as best we can
        string[] parts = ret.Split(',');
        if(parts.Length==expectedNumberOfDataPoints){
        for(int i = 0; i < parts.Length; i++)
        {
            //go back and replace the temporary pipe with the literal comma AFTER split
            parts[i] = parts[i].Replace("|", ",");
        }

        return parts;
        }else{
            //save ret to bad CSV log
            return null;
        }
    }

答案 5 :(得分:0)

我必须这样做一次。我的方法是通过一条线并跟踪我正在阅读的内容。 基本上,我编写了自己的扫描仪,从输入行中删除令牌,这使我可以完全控制我的错误.csv数据。

这就是我所做的:

For each character on a line of input.
 1. when outside of a string meeting a comma => all of the previous string (which can be empty) is a valid token.
 2. when outside of a sting meeting anything but a comma or a quote => now you have a real problem, unquoted tekst => handle as you see fit.
 3. when outside of a string meeing a quote => found a start of string.
 4. when inside of a string meeting a comma => accept the comma as part of the string.
 5. when inside of the string meeting a qoute => trouble starts here, mark this point.
   6. continue and when meeting a comma (skipping white space if desired) close the string, 'unread' the comma and continue. (than will bring you to point 1.)
   7. or continue and when meeting a quote -> obviously, what was read must be part of the string, add it to the string, 'unread' the quote and continue. (that will you bring to point 5)
   8. or continue and find an whitespace, then End Of Line ('\n') -> the last qoute must be the closing quote. accept the string as a value.
   9. or continue and fine non-whitespace, then End Of Line. -> now you have a real problem, you have the start of a string but it is not closed -> handle the error as you see fit.

如果.csv文件中的字段数已修复,您可以将您识别为字段分隔符的逗号计算在内,当您看到End Of Line时,您知道还有其他问题。

通过从输入行接收的字符串流,您可以构建一个“清洁”字符串。 .csv行,这样就构建了一个可以在现有代码中使用的已接受和已清除输入的缓冲区。