Question

我正在尝试匹配类似CSV的文件中的所有新行。问题是巨大的文件总是带有一些断行，例如：

123|some string field|person 123|some optional open comment|324|213
133|some string field|person||324|213
153|some string field|person 123|some comment|324|213
126|some string field|another id|some open and
new line comment|324|213
153|string field|person 123|some comment|324|213
153|string field|person 123|another broken line
comment|324|213
133|field|person||324|213

所以，为了解决这个问题，我使用了以下逻辑：

    string ZSUR = File.ReadAllText(filePath);
    string originalFilePath = filePath;

    // Regular Expression to fix line break issues
    Regex RE = new Regex(@"[\r\t\n]+([^0-9\r\t\n]{3}[^|\r\t\n])");

    ZSUR = RE.Replace(ZSUR, "$1");

    // Backup the original file
    string[] backupFilePath = Regex.Split(filePath, @".txt$");
    File.Delete(backupFilePath[0] + "_BACKUP.txt");
    File.Move(originalFilePath, backupFilePath[0] + "_BACKUP.txt");

    // And then save on the same path the fixed file
    File.WriteAllText(originalFilePath, ZSUR);

它解决了90％的情况，因为正确行的第一部分总是以三位数字开头，然后是管道。

但我不知道为什么它不符合这样的情况：

126|some string field|another id|some open and
double newlined 
123 coment|324|213
153|some string field|person 123|some comment|324|213
153|some string field|person 123|some comment|324|213
153|string field|person 123|Please split this line
31 pcs: 05/03/2013
31|324|213
153|some string field|person 123|some comment|324|213

如您所见，我需要一种不同的方法来解决这个问题。我知道经过N次我有一个烟斗，那个烦人的评论字段就在那里。那么，有一些方法可以在从一行开始的N个管道之后匹配所有新行和类似物吗？

其他人的想法也很受欢迎。

编辑：感谢您的回答。

我使用以下正则表达式解决了这个问题：

(?<!\|[CA]?\|([0-9]{2}.[0-9]{2}.[0-9]{4})?)[\n\r]+

当然，我的真实文件与发布的示例略有不同，但主要意思是匹配所有新行[\ n \ r] +之前没有

(?<! ... )

表达。

Answer 1

您可以处理所有内容，其中“Clean”是您定义的方法。

var prev = string.Empty;
const int requiredValueCount = 6;

foreach (var line in lines2.Split(new[] {Environment.NewLine}, StringSplitOptions.None))
{
    var values = (prev + line).Split('|');

    if (values.Length == requiredValueCount)
    {
        prev = string.Empty;
        Clean(values);
    }
    else
    {
        prev += line;
    }
}

Answer 2

首先用一些奇怪的东西替换所有（\ | \ d + \ n），如\ | \ d ~~

然后加入所有行，删除\ n

然后由~~

分开

Answer 3

我不会不必要地重新发明轮子。试试Sebastien Lorion的Fast CSV Reader。它可能会做你需要做的事情（或提供设施让你对错误采取纠正措施）。我已经使用过这款读卡器而且相当不错。

另一个选项是来自Codeplex的KBCsv。从未使用它，但它可能是好的。

我还采用将文件原样读入记录列表的方法。由于您似乎不需要超过一点前瞻/后瞻，您可以在文件的单次传递中轻松完成，如下所示：

public IEnumerable<string[]> ReadRecordsFromCSV()
{
  string[] prev = null ;
  string[] curr = null ;

  // read each individual record from the file
  while ( null != (curr=MyCsvReader.ReadRecord()) )
  {

    if ( prev == null )
    { // no previous record? just shift and continue
      prev = curr ;
    }
    else
    { // previous record? splice if needed and emit a record
      string[] record ;
      bool spliceNeeded = CheckForSpliceConditions(prev,curr) ;

      if ( spliceNeeded )
      { // splice needed? build the record to emit and clear the previous record
        record = Splice( prev , curr ) ;
        prev = null ;
      }
      else
      { // no splice needed? set the record to emit and shift
        record = prev ;
        prev = curr ;
      }

    }

    // emit the record
    yield return record ;
  }

  // emit the last record if there is one.
  if ( prev != null )
  {
    yield return prev ;
  }

}

如果你需要多个级别的前瞻/后瞻，你需要像移位寄存器这样的东西，你可以在列表的末尾添加记录并从列表的开头删除它们。您可以使用List<string[]>作为移位寄存器，但这样做有点难看。

编辑注意：或者（更简单），如果需要拼接，只需将当前记录追加到上一条记录，直到不再需要拼接为止。一旦这是真的，之前的记录就会被发出，你会重新开始，因此：

public IEnumerable<string[]> ReadRecordsFromCSV()
{
  string[] prev = null ;
  string[] curr = null ;

  // read each individual record from the file
  while ( null != (curr=MyCsvReader.ReadRecord()) )
  {

    if ( prev == null )
    { // no previous record? just shift and continue
      prev = curr ;
    }
    else
    { // previous record? splice if needed and emit a record
      bool spliceNeeded = CheckForSpliceConditions(prev,curr) ;

      if ( spliceNeeded )
      { // splice needed? build the record to emit and clear the previous record
        prev = Splice( prev , curr ) ;
      }
      else
      { // no splice needed? set the record to emit and shift
        yield return prev ;
        prev = null ;
      }

    }

  }

  // emit the last record if there is one.
  if ( prev != null )
  {
    yield return prev ;
  }

}

如何在特定字符的N次后匹配新行？

3 个答案: