Question

I want to find a delimiter being used to separate the columns in csv or text files.

I am using TextFieldParser class to read those files.

Below is my code,

String path = @"c:\abc.csv";
DataTable dt = new DataTable();
if (File.Exists(path))
{
    using (Microsoft.VisualBasic.FileIO.TextFieldParser parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(path))
    {
        parser.TextFieldType = FieldType.Delimited;
        if (path.Contains(".txt"))
        {       
            parser.SetDelimiters("|");
        }
        else
        {
            parser.SetDelimiters(",");
        }
        parser.HasFieldsEnclosedInQuotes = true;
        bool firstLine = true;
        while (!parser.EndOfData)
        {
            string[] fields = parser.ReadFields();
            if (firstLine)
            {
                  foreach (var val in fields)
                  {
                      dt.Columns.Add(val);
                  }
                  firstLine = false;
                  continue;
             }
             dt.Rows.Add(fields);
          }
     }
 lblCount.Text = "Count of total rows in the file: " + dt.Rows.Count.ToString();
 dgvTextFieldParser1.DataSource = dt;

Instead of passing the delimiters manually based on the file type, I want to read the delimiter from the file and then pass it.

How can I do that?

Answer 1

Mathematically correct but totally useless answer: It's not possible.

Pragmatical answer: It's possible but it depends on how much you know about the file's structure. It boils down to a bunch of assumptions and depending on which we'll make, the answer will vary. And if you can't make any assumptions, well... see the mathematically correct answer.

For instance, can we assume that the delimiter is one or any of the elements in the set below?

List<char> delimiters = new List<char>{' ', ';', '|'};

Or can we assume that the delimiter is such that it produces elements of equal length?

Should we try to find a delimiter that's a single character or can a word be one?

Etc.

Based on the question, I'll assume that it's the first option and that we have a limited set of possible characters, precisely one of which is be a delimiter for a given file.

How about you count the number of occurrences of each such character and assume that the one that's occurring most frequently is the one? Is that sufficiently rigid or do you need to be more sure than that?

List<char> delimiters = new List<char>{' ', ';', '-'};
Dictionary<char, int> counts = delimiters.ToDictionary(key => key, value => 0);
foreach(char c in delimiters)
  counts[c] = textArray.Count(t => t == c);

I'm not in front of a computer so I can't verify but the last step would be returning the key from the dictionary the value of which is the maximal.

You'll need to take into consideration a special case such that there's no delimiters detected, there are equally many delimiters of two types etc.

Answer 2

You could probably take n bytes from the file, count possible delimiter characters(or all characters found) using a hash map/dictionary, and then the character repeated most is probably the delimiter you're looking for. It would make sense to me that the characters used as delimiters would be the ones used the most. When done you reset the stream, but since you're using a text reader you would have to probably initialize another text reader or something. This would get slightly more hairy if the CSV used more than one delimiter. You would probably have to ignore some characters like alpha and numeric.

Answer 3

使用LINQ的非常简单的猜测方法：

static class CsvSeperatorDetector
{
    private static readonly char[] SeparatorChars = {';', '|', '\t', ','};

    public static char DetectSeparator(string csvFilePath)
    {
        string[] lines = File.ReadAllLines(csvFilePath);
        return DetectSeparator(lines);
    }

    public static char DetectSeparator(string[] lines)
    {
        var q = SeparatorChars.Select(sep => new
                {Separator = sep, Found = lines.GroupBy(line => line.Count(ch => ch == sep))})
            .OrderByDescending(res => res.Found.Count(grp => grp.Key > 0))
            .ThenBy(res => res.Found.Count())
            .First();

        return q.Separator;
    }
}

它的作用是逐行读取文件（请注意CSV文件可能包含换行符），然后检查每个潜在的分隔符在每行中出现的频率。然后，我们检查哪个分隔符出现在最多的行上，而那些分隔符出现在相同数量的行上，我们选择一个分布最均匀的分隔符（例如，每行中出现5次出现的发生率高于发生在一行中一次发生的发生并在另一行重复10次）。当然，您可能必须出于自己的目的对其进行调整，添加错误处理，后备逻辑等。我确定它并不完美，但对我来说已经足够了。

Find a delimiter of csv or text files in c#

3 个答案: