Question

我正在尝试Delphi XE7 Update 1的并行编程功能。

我创建了一个简单的TParallel.For循环，它基本上会执行一些虚假操作来消磨时间。

我在AWS实例（c4.8xlarge）上的36 vCPU上启动了该程序，试图了解并行编程的优势。

当我第一次启动程序并执行TParallel.For循环时，我看到了显着的增益（虽然admitelly比我预期的36个vCPU少很多）：

Parallel matches: 23077072 in 242ms
Single Threaded matches: 23077072 in 2314ms

如果我不关闭程序并在不久之后再次在36 vCPU机器上运行传递（例如，立即或大约10-20秒后），并行传递会恶化很多：

Parallel matches: 23077169 in 2322ms
Single Threaded matches: 23077169 in 2316ms

如果我没有关闭程序并等待几分钟（不是几秒钟，但几分钟）再次运行通行证，我再次得到我在第一次启动程序时得到的结果（10倍的改进）响应时间）。

启动程序后的第一个传递在36个vCPU机器上总是更快，所以看起来这种效果只在第二次在程序中调用TParallel.For时发生。

这是我正在运行的示例代码：

unit ParallelTests;

interface

uses
  Winapi.Windows, Winapi.Messages, System.SysUtils, System.Variants, System.Classes, Vcl.Graphics,
  System.Threading, System.SyncObjs, System.Diagnostics,
  Vcl.Controls, Vcl.Forms, Vcl.Dialogs, Vcl.StdCtrls;

type
  TForm1 = class(TForm)
    Button1: TButton;
    Memo1: TMemo;
    SingleThreadCheckBox: TCheckBox;
    ParallelCheckBox: TCheckBox;
    UnitsEdit: TEdit;
    Label1: TLabel;
    procedure Button1Click(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.dfm}

procedure TForm1.Button1Click(Sender: TObject);
var
  matches: integer;
  i,j: integer;
  sw: TStopWatch;
  maxItems: integer;
  referenceStr: string;

 begin
  sw := TStopWatch.Create;

  maxItems := 5000;

  Randomize;
  SetLength(referenceStr,120000); for i := 1 to 120000 do referenceStr[i] := Chr(Ord('a') + Random(26)); 

  if ParallelCheckBox.Checked then begin
    matches := 0;
    sw.Reset;
    sw.Start;
    TParallel.For(1, MaxItems,
      procedure (Value: Integer)
        var
          index: integer;
          found: integer;
        begin
          found := 0;
          for index := 1 to length(referenceStr) do begin
            if (((Value mod 26) + ord('a')) = ord(referenceStr[index])) then begin
              inc(found);
            end;
          end;
          TInterlocked.Add(matches, found);
        end);
    sw.Stop;
    Memo1.Lines.Add('Parallel matches: ' + IntToStr(matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
  end;

  if SingleThreadCheckBox.Checked then begin
    matches := 0;
    sw.Reset;
    sw.Start;
    for i := 1 to MaxItems do begin
      for j := 1 to length(referenceStr) do begin
        if (((i mod 26) + ord('a')) = ord(referenceStr[j])) then begin
          inc(matches);
        end;
      end;
    end;
    sw.Stop;
    Memo1.Lines.Add('Single Threaded matches: ' + IntToStr(Matches) + ' in ' + IntToStr(sw.ElapsedMilliseconds) + 'ms');
  end;
end;

end.

这是否按设计工作？我发现这篇文章（http://delphiaball.co.uk/tag/parallel-programming/）建议我让图书馆决定线程池，但如果我必须等待几分钟从请求到请求，我就不会看到使用并行编程的重点，以便更快地提供请求

我是否遗漏了应该如何使用TParallel.For循环？

请注意，我无法在AWS m3.large实例（根据AWS的2个vCPU）上重现此内容。在那种情况下，我总是会有轻微的改进，并且在接下来的TParallel.For之后的调用中我不会得到更糟糕的结果。

Parallel matches: 23077054 in 2057ms
Single Threaded matches: 23077054 in 2900ms

因此，当有许多可用内核（36）时，似乎会出现这种影响，这很可惜，因为并行编程的整个要点是要从许多内核中受益。我想知道这是一个库错误，因为核心数量很多，或者在这种情况下核心数不是2的幂。

更新：使用不同vCPU的各种实例对其进行测试   在AWS中计算，这似乎是行为：


36个vCPU（c4.8xlarge）。您必须在后续调用vanilla TParallel调用之间等待几分钟（这使得它无法使用   生产）

32个vCPU（c3.8xlarge）。您必须在后续调用vanilla TParallel调用之间等待几分钟（这使得它无法使用   生产）

16个vCPU（c3.4xlarge）。你必须等二次。如果负载低但响应时间仍然很重要，它可以使用

8个vCPU（c3.2xlarge）。它似乎正常工作

4个vCPU（c3.xlarge）。它似乎正常工作

2个vCPU（m3.large）。它似乎正常工作

Answer 1

我创建了两个基于您的测试程序来比较System.Threading和OTL。我使用XE7更新1和OTL r1397构建。我使用的OTL源对应于3.04版。我使用32位Windows编译器构建，使用版本构建选项。

我的测试机器是运行Windows 7 x64的双Intel Xeon E5530。该系统有两个四核处理器。总共有8个处理器，但系统表示由于超线程而有16个处理器。经验告诉我，超线程只是营销方式，而且我从未见过在这台机器上超过8倍。

现在两个程序几乎相同。

<强>的System.Threading

program SystemThreadingTest;

{$APPTYPE CONSOLE}

uses
  System.Diagnostics,
  System.Threading;

const
  maxItems = 5000;
  DataSize = 100000;

procedure DoTest;
var
  matches: integer;
  i, j: integer;
  sw: TStopWatch;
  referenceStr: string;
begin
  Randomize;
  SetLength(referenceStr, DataSize);
  for i := low(referenceStr) to high(referenceStr) do
    referenceStr[i] := Chr(Ord('a') + Random(26));

  // parallel
  matches := 0;
  sw := TStopWatch.StartNew;
  TParallel.For(1, maxItems,
    procedure(Value: integer)
    var
      index: integer;
      found: integer;
    begin
      found := 0;
      for index := low(referenceStr) to high(referenceStr) do
        if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
          inc(found);
      AtomicIncrement(matches, found);
    end);
  Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');

  // serial
  matches := 0;
  sw := TStopWatch.StartNew;
  for i := 1 to maxItems do
    for j := low(referenceStr) to high(referenceStr) do
      if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
        inc(matches);
  Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;

begin
  while True do
    DoTest;
end.

<强> OTL

program OTLTest;

{$APPTYPE CONSOLE}

uses
  Winapi.Windows,
  Winapi.Messages,
  System.Diagnostics,
  OtlParallel;

const
  maxItems = 5000;
  DataSize = 100000;

procedure ProcessThreadMessages;
var
  msg: TMsg;
begin
  while PeekMessage(Msg, 0, 0, 0, PM_REMOVE) and (Msg.Message <> WM_QUIT) do begin
    TranslateMessage(Msg);
    DispatchMessage(Msg);
  end;
end;

procedure DoTest;
var
  matches: integer;
  i, j: integer;
  sw: TStopWatch;
  referenceStr: string;
begin
  Randomize;
  SetLength(referenceStr, DataSize);
  for i := low(referenceStr) to high(referenceStr) do
    referenceStr[i] := Chr(Ord('a') + Random(26));

  // parallel
  matches := 0;
  sw := TStopWatch.StartNew;
  Parallel.For(1, maxItems).Execute(
    procedure(Value: integer)
    var
      index: integer;
      found: integer;
    begin
      found := 0;
      for index := low(referenceStr) to high(referenceStr) do
        if (((Value mod 26) + Ord('a')) = Ord(referenceStr[index])) then
          inc(found);
      AtomicIncrement(matches, found);
    end);
  Writeln('Parallel matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');

  ProcessThreadMessages;

  // serial
  matches := 0;
  sw := TStopWatch.StartNew;
  for i := 1 to maxItems do
    for j := low(referenceStr) to high(referenceStr) do
      if (((i mod 26) + Ord('a')) = Ord(referenceStr[j])) then
        inc(matches);
  Writeln('Serial matches: ', matches, ' in ', sw.ElapsedMilliseconds, 'ms');
end;

begin
  while True do
    DoTest;
end.

现在是输出。

System.Threading输出

Parallel matches: 19230817 in 374ms
Serial matches: 19230817 in 2423ms
Parallel matches: 19230698 in 374ms
Serial matches: 19230698 in 2409ms
Parallel matches: 19230556 in 368ms
Serial matches: 19230556 in 2433ms
Parallel matches: 19230635 in 2412ms
Serial matches: 19230635 in 2430ms
Parallel matches: 19230843 in 2441ms
Serial matches: 19230843 in 2413ms
Parallel matches: 19230905 in 2493ms
Serial matches: 19230905 in 2423ms
Parallel matches: 19231032 in 2430ms
Serial matches: 19231032 in 2443ms
Parallel matches: 19230669 in 2440ms
Serial matches: 19230669 in 2473ms
Parallel matches: 19230811 in 2404ms
Serial matches: 19230811 in 2432ms
....

OTL输出

Parallel matches: 19230667 in 422ms
Serial matches: 19230667 in 2475ms
Parallel matches: 19230663 in 335ms
Serial matches: 19230663 in 2438ms
Parallel matches: 19230889 in 395ms
Serial matches: 19230889 in 2461ms
Parallel matches: 19230874 in 391ms
Serial matches: 19230874 in 2441ms
Parallel matches: 19230617 in 385ms
Serial matches: 19230617 in 2524ms
Parallel matches: 19231021 in 368ms
Serial matches: 19231021 in 2455ms
Parallel matches: 19230904 in 357ms
Serial matches: 19230904 in 2537ms
Parallel matches: 19230568 in 373ms
Serial matches: 19230568 in 2456ms
Parallel matches: 19230758 in 333ms
Serial matches: 19230758 in 2710ms
Parallel matches: 19230580 in 371ms
Serial matches: 19230580 in 2532ms
Parallel matches: 19230534 in 336ms
Serial matches: 19230534 in 2436ms
Parallel matches: 19230879 in 368ms
Serial matches: 19230879 in 2419ms
Parallel matches: 19230651 in 409ms
Serial matches: 19230651 in 2598ms
Parallel matches: 19230461 in 357ms
....

我让OTL版本运行了很长时间，模式从未改变过。并行版本总是比串行版快7倍。

<强>结论

代码非常简单。可以得出的唯一合理结论是System.Threading的实施是有缺陷的。

有许多与新System.Threading库相关的错误报告。所有的迹象都表明它的质量很差。 Embarcadero在发布不合标准的库代码方面有着悠久的历史记录。我正在考虑TMonitor，XE3字符串助手，早期版本的System.IOUtils，FireMonkey。名单还在继续。

很明显，Embarcadero的质量是一个大问题。代码发布，很明显没有经过充分测试，如果有的话。这对于线程库来说尤其麻烦，其中错误可以处于休眠状态并且仅在特定的硬件/软件配置中公开。来自TMonitor的经验让我相信Embarcadero没有足够的专业知识来生成高质量，正确的线程代码。

我的建议是，您不应以当前形式使用System.Threading。在可以看出它具有足够的质量和正确性的时候，它应该被避开。我建议你使用OTL。

编辑：程序的原始OTL版本有一个实时内存泄漏，这是因为实现细节很难实现。 Parallel.For使用.Unobserved修饰符创建任务。这导致所述任务仅在某个内部消息窗口收到“任务已终止”时被销毁。信息。该窗口在与Parallel.For调用者相同的线程中创建 - 即在这种情况下在主线程中。由于主线程没有处理消息，因此任务永远不会被破坏，内存消耗（以及其他资源）只会堆积起来。因为该程序可能会在一段时间后被挂起。

TParallel的奇怪行为。对于默认的ThreadPool

1 个答案: