我正在尝试使用此正则表达式来捕获制表符分隔线上的字段。这似乎适用于所有情况,除非该行以两个标签开头:
^\t|"(?<field>[^"]+|\t(?=\t))"|(?<field>[^\t]+|\t(?=\t))|\t$
例如,其中\ t表示制表符:
\t \t 123 \t abc \t 345 \t efg
仅捕获5个字段,省略了第一个“空白”(标签)之一
答案 0 :(得分:3)
正则表达式可能不是这项工作的最佳工具。我建议你使用TextFieldParser
类,它用于解析带有分隔或固定长度字段的文件。如果您使用C#进行编码,它驻留在Microsoft.VisualBasic程序集中的事实有点烦人,但它并不妨碍您使用它...
答案 1 :(得分:1)
同意Regex不适合这里的工作。
当Thomas发布链接到框架中的一个漂亮的小宝石时,我正在清理这个代码。我已经使用此方法来解析可能包含带引号的字符串和转义字符的分隔文本。它可能不是世界上最优化的,但在我看来它很可读,它完成了工作。
/// <summary>
/// Breaks a string into tokens using a delimeter and specified text qualifier and escape sequence.
/// </summary>
/// <param name="line">The string to tokenize.</param>
/// <param name="delimeter">The delimeter between tokens, such as a comma.</param>
/// <param name="textQualifier">The text qualifier which enables the delimeter to be embedded in a single token.</param>
/// <param name="escapeSequence">The escape sequence which enables the text qualifier to be embedded in a token.</param>
/// <returns>A collection of string tokens.</returns>
public static IEnumerable<string> Tokenize( string line, char delimeter, char textQualifier = '\"', char escapeSequence = '\\' )
{
var inString = false;
var escapeNext = false;
var token = new StringBuilder();
for (int i = 0 ; i < line.Length ; i++) {
// If the last character was an escape sequence, then it doesn't matter what
// this character is (field terminator, text qualifier, etc) because it needs
// to appear as a part of the field value.
if (escapeNext) {
escapeNext = false;
token.Append(line[i]);
continue;
}
if (line[i] == escapeSequence) {
escapeNext = true;
continue;
}
if (line[i] == textQualifier) {
inString = !inString;
continue;
}
// hit the end of the current token?
if (line[i] == delimeter && !inString) {
yield return token.ToString();
// clear the string builder (instead of allocating a new one)
token.Remove(0, token.Length);
continue;
}
token.Append(line[i]);
}
yield return token.ToString( );
}