经过大量测量后,我发现了一个我想要优化的Windows服务中的热点。我们正在处理可能有多个连续空格的字符串,我们希望减少到只有一个空格。我们使用静态编译的正则表达式完成此任务:
private static readonly Regex
regex_select_all_multiple_whitespace_chars =
new Regex(@"\s+",RegexOptions.Compiled);
然后按如下方式使用它:
var cleanString=
regex_select_all_multiple_whitespace_chars.Replace(dirtyString.Trim(), " ");
该行被调用数百万次,并且证明相当密集。我试着写一些更好的东西,但我很难过。鉴于正则表达式的处理要求相当适中,肯定会有更快的速度。使用指针进行unsafe
处理可以进一步加快速度吗?
编辑:
感谢对这个问题的惊人反应......最让人意想不到的!
答案 0 :(得分:8)
这大约快了三倍:
private static string RemoveDuplicateSpaces(string text) {
StringBuilder b = new StringBuilder(text.Length);
bool space = false;
foreach (char c in text) {
if (c == ' ') {
if (!space) b.Append(c);
space = true;
} else {
b.Append(c);
space = false;
}
}
return b.ToString();
}
答案 1 :(得分:7)
这个怎么样......
public string RemoveMultiSpace(string test)
{
var words = test.Split(new char[] { ' ' },
StringSplitOptions.RemoveEmptyEntries);
return string.Join(" ", words);
}
使用NUnit运行测试用例:
测试时间以毫秒为单位。
Regex Test time: 338,8885
RemoveMultiSpace Test time: 78,9335
private static readonly Regex regex_select_all_multiple_whitespace_chars =
new Regex(@"\s+", RegexOptions.Compiled);
[Test]
public void Test()
{
string startString = "A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F ";
string cleanString;
Trace.WriteLine("Regex Test start");
int count = 10000;
Stopwatch timer = new Stopwatch();
timer.Start();
for (int i = 0; i < count; i++)
{
cleanString = regex_select_all_multiple_whitespace_chars.Replace(startString, " ");
}
var elapsed = timer.Elapsed;
Trace.WriteLine("Regex Test end");
Trace.WriteLine("Regex Test time: " + elapsed.TotalMilliseconds);
Trace.WriteLine("RemoveMultiSpace Test start");
timer = new Stopwatch();
timer.Start();
for (int i = 0; i < count; i++)
{
cleanString = RemoveMultiSpace(startString);
}
elapsed = timer.Elapsed;
Trace.WriteLine("RemoveMultiSpace Test end");
Trace.WriteLine("RemoveMultiSpace Test time: " + elapsed.TotalMilliseconds);
}
public string RemoveMultiSpace(string test)
{
var words = test.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
return string.Join(" ", words);
}
修改:
做了一些测试,并添加了基于StringBuilder的Guffa方法“RemoveDuplicateSpaces”
所以我的结论是当有很多空格时StringBuilder方法更快,但是空格更少,字符串拆分方法稍微快一些。
Cleaning file with about 30000 lines, 10 iterations
RegEx time elapsed: 608,0623
RemoveMultiSpace time elapsed: 239,2049
RemoveDuplicateSpaces time elapsed: 307,2044
Cleaning string, 10000 iterations:
A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F
RegEx time elapsed: 590,3626
RemoveMultiSpace time elapsed: 159,4547
RemoveDuplicateSpaces time elapsed: 137,6816
Cleaning string, 10000 iterations:
A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F A B C D E F
RegEx time elapsed: 290,5666
RemoveMultiSpace time elapsed: 64,6776
RemoveDuplicateSpaces time elapsed: 52,4732
答案 2 :(得分:6)
目前,您正在用另一个空格替换单个空格。尝试匹配\s{2,}
(或类似的东西,如果你想替换单个换行符和其他字符)。
答案 3 :(得分:3)
只有一条建议,如果您的数据没有unicode空格,而不是\s+
使用[ \r\n]+
或[ \n]+
或仅 +
(如果只有空格) ,基本上将其限制为最小字符集。
答案 4 :(得分:3)
您无法使用正则表达式。例如:
private static string NormalizeWhitespace(string test)
{
string trimmed = test.Trim();
var sb = new StringBuilder(trimmed.Length);
int i = 0;
while (i < trimmed.Length)
{
if (trimmed[i] == ' ')
{
sb.Append(trimmed[i]);
do { i++; } while (i < trimmed.Length && trimmed[i] == ' ');
}
sb.Append(trimmed[i]);
i++;
}
return sb.ToString();
}
用这种方法和下面的试验台:
private static readonly Regex MultipleWhitespaceRegex = new Regex(
@"\s+",
RegexOptions.Compiled);
static void Main(string[] args)
{
string test = "regex select all multiple whitespace chars";
const int Iterations = 15000;
var sw = new Stopwatch();
sw.Start();
for (int i = 0; i < Iterations; i++)
{
NormalizeWhitespace(test);
}
sw.Stop();
Console.WriteLine("{0}ms", sw.ElapsedMilliseconds);
sw.Reset();
sw.Start();
for (int i = 0; i < Iterations; i++)
{
MultipleWhitespaceRegex.Replace(test, " ");
}
sw.Stop();
Console.WriteLine("{0}ms", sw.ElapsedMilliseconds);
}
我得到了以下结果:
// NormalizeWhitespace - 27ms
// Regex - 132ms
请注意,这仅使用一个非常简单的示例进行测试,可以通过删除对String.Trim
的调用进一步优化,并且仅用于指出正则表达式有时不是最佳答案。
答案 5 :(得分:3)
我很好奇直接实施可能会如何执行:
static string RemoveConsecutiveSpaces(string input)
{
bool whiteSpaceWritten = false;
StringBuilder sbOutput = new StringBuilder(input.Length);
foreach (Char c in input)
{
if (c == ' ')
{
if (!whiteSpaceWritten)
{
whiteSpaceWritten = true;
sbOutput.Append(c);
}
}
else
{
whiteSpaceWritten = false;
sbOutput.Append(c);
}
}
return sbOutput.ToString();
}
答案 6 :(得分:0)
由于这是一个简单的表达式,用一个空格替换两个或多个空格,摆脱Regex对象并自行硬编码替换(在C ++ / CLI中):
String ^text = "Some text to process";
bool spaces = false;
// make the following static and just clear it rather than reallocating it every time
System::Text::StringBuilder ^output = gcnew System::Text::StringBuilder;
for (int i = 0, l = text->Length ; i < l ; ++i)
{
if (spaces)
{
if (text [i] != ' ')
{
output->Append (text [i]);
spaces = false;
}
}
else
{
output->Append (text [i]);
if (text [i] == ' ')
{
spaces = true;
}
}
}
text = output->ToString ();
答案 7 :(得分:0)
阵列总是会更快
public static string RemoveMultiSpace(string input)
{
var value = input;
if (!string.IsNullOrEmpty(input))
{
var isSpace = false;
var index = 0;
var length = input.Length;
var tempArray = new char[length];
for (int i = 0; i < length; i++)
{
var symbol = input[i];
if (symbol == ' ')
{
if (!isSpace)
{
tempArray[index++] = symbol;
}
isSpace = true;
}
else
{
tempArray[index++] = symbol;
isSpace = false;
}
}
value = new string(tempArray, 0, index);
}
return value;
}