网上(以及在SO中)有很多搜索结果用于类似到我需要做的事情,但我还没有针对我的特定情况遇到解决方案
我有一个以逗号分隔的文件,其中只有包含逗号的列在它们周围有双引号。其他没有逗号的字段只用逗号分隔。
举个例子:
123,"box,toy",phone,"red,car,cat,dog","bike,pencil",man,africa,yellow,"jump,rope"
该行的输出必须是:
123|box,toy|phone|red,car,cat,dog|bike,pencil|man|africa|yellow|jump,rope
我目前有这个代码:
Using sr As New StreamReader(csvFilePath)
Dim line As String = ""
Dim strReplacerQuoteCommaQuote As String = Chr(34) & "," & Chr(34)
Dim strReplacerQuoteComma As String = Chr(34) & ","
Dim strReplacerCommaQuote As String = "," & Chr(34)
Do While sr.Peek <> -1
line = sr.ReadLine
line = Replace(line, strReplacerQuoteCommaQuote, "|")
line = Replace(line, strReplacerQuoteComma, "|")
line = Replace(line, strReplacerCommaQuote, "|")
line = Replace(line, Chr(34), "")
Console.WriteLine("line: " & line)
Loop
End Using
该过程的问题是当我到达第四行()时,字符串如下所示:
123|box,toy|phone|red,car,cat,dog|bike,pencil|man,africa,yellow|jump,rope
所以男人和非洲人需要在他们之后使用管道,但显然我不能在所有逗号上做替换。
我该怎么做?是否有可以处理此问题的RegEx语句?
使用工作代码更新
Avinash评论中的link得到了答案。我导入了System.Text.RegularExpressions并使用了以下内容:
Using sr As New StreamReader(csvFilePath)
Dim line As String = ""
Dim strReplacerQuoteCommaQuote As String = Chr(34) & "," & Chr(34)
Dim strReplacerQuoteComma As String = Chr(34) & ","
Dim strReplacerCommaQuote As String = "," & Chr(34)
Do While sr.Peek <> -1
line = sr.ReadLine
Dim pattern As String = "(,)(?=(?:[^""]|""[^""]*"")*$)"
Dim replacement As String = "|"
Dim regEx As New Regex(pattern)
Dim newLine As String = regEx.Replace(line, replacement)
newLine = newLine.Replace(Chr(34), "")
Console.WriteLine("newLine: " & newLine)
Loop
End Using
答案 0 :(得分:3)
这似乎适用于您的示例:
Dim result = Regex.Replace(input, ",(?=([^""]*""[^""]*"")*[^""]*$)", Function(m) m.Value.Replace(",", "|"))
result = result.Replace(Chr(34), "")
请参阅已接受的答案here以获取正则表达式的解释,并确保在您完成时{@ 3}},因为我基本上只是偷了他的正则表达式。
修改强> 关于您的性能问题,我创建了一个包含90k行的文件:
abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,yellow,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz"
大致相当于35MB的文件大小,我的笔记本电脑(没什么特别的)会在大约6.5秒内解析它。
是的,正则表达式很慢,并且TextFieldParser类也被广泛报道为不是最快的,但如果你仍在处理超过5分钟,你的代码显然还有其他一些瓶颈。请注意,我实际上并没有对解析的结果做任何事情。
编辑2:好的,我以为我最后一次(我今天早上很无聊)但我仍然无法复制你的延长转换时间。< / p>
时间变得残酷,我创建了一个150k行的输入文件:
abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz"
每行有1140个字符,总文件大小约为167MB。
使用以下代码读取,转换和写回新文件 29 秒。
Dim line, result As String
Dim replace As String = ",(?=([^""]*""[^""]*"")*[^""]*$)"
Using sw As New StreamWriter("d:\output.txt")
Using sr As New StreamReader("d:\input.txt")
While Not sr.EndOfStream
line = sr.ReadLine
result = Regex.Replace(line, replace, Function(m) m.Value.Replace(",", "|"))
sw.WriteLine(result.Replace(Chr(34), ""))
End While
End Using
End Using
修改3 :使用@ sln的正则表达式,此代码将同一文件的处理时间缩短为 4 秒。
Dim line, result As String
Dim pattern As String = ",([^,""]*(?:""[^""]*"")?[^,""]*)(?=,|$)"
Dim replacement As String = "|$1"
Dim rgx As New Regex(pattern)
Using sw As New StreamWriter("d:\output.txt")
Using sr As New StreamReader("d:\input.txt")
While Not sr.EndOfStream
line = sr.ReadLine
result = rgx.Replace(line, replacement)
sw.WriteLine(result.Replace(Chr(34), ""))
End While
End Using
End Using
所以,你去,我认为你有一个胜利者。作为sln状态,这是一个相对测试,因此机器速度无关紧要。
,(?=([^"]*"[^"]*")*[^"]*$) took 29 seconds
,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$) took 4 seconds
最后(并且只是为了完整性)@ jawood2005提出的解决方案非常可行:
Dim line As String
Dim fields As String()
Using sw As New StreamWriter("d:\output.txt")
Using tfp As New FileIO.TextFieldParser("d:\input.txt")
tfp.TextFieldType = FileIO.FieldType.Delimited
tfp.Delimiters = New String() {","}
tfp.HasFieldsEnclosedInQuotes = True
While Not tfp.EndOfData
fields = tfp.ReadFields
line = String.Join("|", fields)
sw.WriteLine(line.Replace(Chr(34), ""))
End While
End Using
End Using
使用相同的150k行输入文件作为正则表达式解决方案,这在 18 秒内完成,因此比我的更好,但是sln赢得了最快解决问题的奖励。
答案 1 :(得分:3)
防弹方式。
# Validate even quotes (one time match): ^[^"]*(?:"[^"]*"[^"]*)*$
# Then ->
# ----------------------------------------------
# Find: /,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$)/
# Replace: '|$1'
,
( # (1 start)
[^,"]*
(?: " [^"]* " )?
[^,"]*
) # (1 end)
(?= , | $ )
基准
由于@TheBlueDog发布了一个基准('编辑2'),我以为我会发布一个 基准也是。
它基于他的意见,其意图是展示使用
的邪恶
'到字符串结尾'预测作为验证技术
(即,这个 - &gt; ^[^"]*(?:"[^"]*"[^"]*)*$
)
Blue Dog的正则表达式替换方法因为不必要的回调而受到了一些阻碍,所以我 想象一下他的一些不好的数字。
不知道Vb.net所以这是在Perl中完成的。机器速度和语言都被考虑在内了 因为它是一个相对的测试。
摘要:
,(?=([^"]*"[^"]*")*[^"]*$) took 10 seconds
,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$) took 2 seconds
这表示5倍的差异。
Perl的基准测试,150K行(167MB文件):
use strict;
use warnings;
use Benchmark ':hireswallclock';
my ($t0,$t1);
my ($infile, $outfile);
my $tstr = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz"
';
# =================================================
print "\nMaking 150K line (167MB file), csv_data_in.txt ...";
open( $infile, ">", 'csv_data_in.txt' ) or die "can't open 'csv_data_in.txt' for writing $!";
for (1 .. 150_000)
{
print $infile $tstr;
}
close( $infile );
print "\nDone !\n\n";
# =================================================
print "Converting delimiters, writing to csv_data_out.txt ...";
open( $infile, "<", 'csv_data_in.txt' ) or die "can't open 'csv_data_in.txt' for readimg $!";
open( $outfile, ">", 'csv_data_out.txt' ) or die "can't open 'csv_data_out.txt' for writing $!";
my $line = '';
$t0 = new Benchmark;
while( $line = <$infile> )
{
# Validation - Uncomment to check line for even quotes, otherwise don't
# if ( $line =~ /^[^"]*(?:"[^"]*"[^"]*)*$/ )
# {
$line =~ s/,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$)/|$1/g;
# }
print $outfile $line;
}
$t1 = new Benchmark;
close( $infile );
close( $outfile );
print "\nDone !\n";
print "Conversion took: ", timestr(timediff($t1, $t0)), "\n\n";
输出:
Making 150K line (167MB file), csv_data_in.txt ...
Done !
Converting delimiters, writing to csv_data_out.txt ...
Done !
Conversion took: 2.1216 wallclock secs ( 1.87 usr + 0.17 sys = 2.04 CPU)
答案 2 :(得分:1)
这可能不是最佳解决方案,但应该有效......
我99%肯定您正在使用StreamReader(“sr”)来读取文件。尝试使用FileIO.TextFieldParser读取它,这将允许您将行拆分为字符串数组。
Dim aFile As FileIO.TextFieldParser = New FileIO.TextFieldParser(filePath)
Dim temp() As String ' this array will hold each line of data
Dim order As doOrder = Nothing
Dim orderID As Integer
Dim myDate As DateTime = Now.ToString
aFile.TextFieldType = FileIO.FieldType.Delimited
aFile.Delimiters = New String() {","}
aFile.HasFieldsEnclosedInQuotes = True
temp = aFile.ReadFields
' parse the actual file
Do While Not aFile.EndOfData...
在循环中,继续使用“aFile.ReadFields”来读取下一行。获得String数组后,可以将每个字段与它们之间的管道连接起来。有点凌乱,而不是正则表达式(不知道这是一个实际情况还是只是一个想法),但会完成工作。
此外,请注意“aFile.HasFieldsEnclosedInQuotes = True”,因为这是您列出的条件之一。
编辑:我看到The Blue Dog在我尝试键入时给出了正则表达式的答案...您可能仍然希望使用TextFieldParser,因为您正在阅读分隔文件。我现在就走开。