为什么多线程内存分配/解除分配密集型应用程序不会随线程数量而扩展?

时间:2014-11-13 15:47:11

标签: json multithreading delphi dwscript

注意:

原帖子标题

  

为什么DWScript中的多线程JSON解析器无法按线程数进行扩展?

已更改,因为此问题与使用DWScript 处理JSON数据无关。 问题出在Delphi XE2到XE7的默认内存管理器中(测试的是XE2和试用版XE7),但问题首先出现在这种类型的应用程序中。


我有多线程的Win32 / Win64 vcl应用程序,它在Delphi XE2中处理JSON数据。

每个线程使用DWScript中的TdwsJSONValue.ParseString(sJSON)解析JSON数据,使用DWScript方法读取值并将结果存储为记录。

出于测试目的,我在每个线程中处理相同的JSON数据。

单个thead运行在线程内花费N秒来处理数据。越来越多的线程到M lineary(大约M * N)会增加处理相同数据所需的单线程内的时间。

结果没有速度提升。此应用程序的其他部分(JSON数据传递,将结果存储在目标环境中) - 按预期进行扩展。

可能是什么原因?任何想法都赞赏。

补充资料:

  1. 在Win7 / 32和Win7 / 64,Win8 / 64上测试,从2核到12核(w / w-out HT)系统

  2. 选择最快的DWScript(测试了一堆,其中包括:Superobject,内置Delphi)。 SO的行为类似于DWS中的JSON单元。

  3. 以下是完整的控制台应用程序,说明了问题。要运行它,我们需要这里提供的示例json数据:https://www.dropbox.com/s/4iuv87ytpcdugk6/json1.zip?dl=0此文件包含第一个线程的数据json1.dat。对于最多16个线程,只需将json1.dat复制到json2.dat ... json16.dat即可。

    程序和数据应该在同一个文件夹中。运行:convert.exe N,其中N是线程数。

    程序将执行时间在msecs中写入stout - 在线程中花费,解析数据的时间和释放时间(Destroy)TdwsJSONValue对象。 声明_dwsjvData.Destroy;无法扩展。


  4. program Convert;
    
    {$APPTYPE CONSOLE}
    
    {$R *.res}
    
    uses
      System.SysUtils,
      System.Diagnostics,
      System.Classes,
      dwsJSON in 'dwsJSON.pas',
      dwsStrings in 'dwsStrings.pas',
      dwsUtils in 'dwsUtils.pas',
      dwsXPlatform in 'dwsXPlatform.pas';
    
    type
    
      TWorkerThread = class (TThread)
      private
        _iUid:  Integer;
        _swWatch:  TStopwatch;
        _lRunning:  Boolean;
    
        _sFileJSonData:  String;
        _fJsonData:  TextFile;
    
      protected
        constructor Create (AUid: Integer);
        procedure Execute; override;
    
      published
        property Running: Boolean read _lRunning;
    
      end;
    
      TConverter = class (TObject)
      private
        _swWatch0, _swWatch1, _swWatch2:  TStopwatch;
    
        _dwsjvData:  TdwsJSONValue;
    
      protected
        constructor Create;
        destructor Destroy; override;
    
        function Calculate (AUid: Integer; AJSonData: String; var AParse, ADestroy: Integer): Integer;
      end;
    
    const
      MAX_THREADS = 16;
    
    var
      iHowMany:  Integer;
      athWorker:  array [1..MAX_THREADS] of Pointer;
      aiElapsed:  array [1..MAX_THREADS] of Integer;
      aiElapsedParse:  array [1..MAX_THREADS] of Integer;
      aiElapsedDestroy:  array [1..MAX_THREADS] of Integer;
      aiFares:  array [1..MAX_THREADS] of Integer;
      swWatchT, swWatchP:  TStopwatch;
    
    
    constructor TWorkerThread.Create (AUid: Integer);
    begin
      inherited Create (True);
    
      _iUid := AUid;
      _swWatch := TStopwatch.Create;
      _sFileJSonData := ExtractFilePath (ParamStr (0)) + 'json' + Trim (IntToStr (_iUid)) + '.dat';
    
      _lRunning := False;
    
      Suspended := False;
    end;
    
    procedure TWorkerThread.Execute;
    var
      j:  Integer;
      sLine:  String;
      slLines:  TStringList;
    
      oS:  TConverter;
    begin
      _lRunning := True;
    
      oS := TConverter.Create;
    
      slLines := TStringList.Create;
      System.AssignFile (_fJsonData, _sFileJSonData);
      System.Reset (_fJsonData);
      j := 0;
      repeat
        System.Readln (_fJsonData, sLine);
        slLines.Add (sLine);
        Inc (j);
      until (j = 50);
    //  until (System.Eof (_fJsonData));
      System.Close (_fJsonData);
    
      Sleep (1000);
    
      _swWatch.Reset;
      _swWatch.Start;
    
      aiFares [_iUid] := 0;
      aiElapsedParse [_iUid] := 0;
      aiElapsedDestroy [_iUid] := 0;
      for j := 1 to slLines.Count do
        aiFares [_iUid] := aiFares [_iUid] + oS.Calculate (_iUid, slLines.Strings [j - 1], aiElapsedParse [_iUid], aiElapsedDestroy [_iUid]);
    
      _swWatch.Stop;
    
      slLines.Free;
      os.Destroy;
    
      aiElapsed [_iUid] := _swWatch.ElapsedMilliseconds;
    
      _lRunning := False;
    end;
    
    constructor TConverter.Create;
    begin
      inherited Create;
    
      _swWatch0 := TStopwatch.Create;
      _swWatch1 := TStopwatch.Create;
      _swWatch2 := TStopwatch.Create;
    end;
    
    destructor TConverter.Destroy;
    begin
      inherited;
    end;
    
    function TConverter.Calculate (AUid: Integer; AJSonData: String; var AParse, ADestroy: Integer): Integer;
    var
      jFare, jTotalFares, iElapsedParse, iElapsedDestroy, iElapsedTotal:  Integer;
    begin
      _swWatch0.Reset;
      _swWatch0.Start;
    
      _swWatch1.Reset;
      _swWatch1.Start;
      _dwsjvData := TdwsJSONValue.ParseString (AJSonData);
      _swWatch1.Stop;
      iElapsedParse := _swWatch1.ElapsedMilliseconds;
    
      if (_dwsjvData.ValueType = jvtArray) then
      begin
        _swWatch2.Reset;
        _swWatch2.Start;
    
        jTotalFares := _dwsjvData.ElementCount;
        for jFare := 0 to (jTotalFares - 1) do
          if (_dwsjvData.Elements [jFare].ValueType = jvtObject) then
          begin
    
            _swWatch1.Reset;
            _swWatch1.Start;
    
            _swWatch1.Stop;
          end;
      end;
    
      _swWatch1.Reset;
      _swWatch1.Start;
      _dwsjvData.Destroy;
      _swWatch1.Stop;
      iElapsedDestroy := _swWatch1.ElapsedMilliseconds;
    
      _swWatch0.Stop;
      iElapsedTotal := _swWatch0.ElapsedMilliseconds;
    
      Inc (AParse, iElapsedParse);
      Inc (ADestroy, iElapsedDestroy);
    
      result := jTotalFares;
    end;
    
    procedure MultithreadStart;
    var
      j:  Integer;
    begin
      for j := 1 to iHowMany do
        if (athWorker [j] = nil) then
        begin
          athWorker [j] := TWorkerThread.Create (j);
    
          TWorkerThread (athWorker [j]).FreeOnTerminate := False;
          TWorkerThread (athWorker [j]).Priority := tpNormal;
        end;
    end;
    
    procedure MultithreadStop;
    var
      j:  Integer;
    begin
      for j := 1 to MAX_THREADS do
        if (athWorker [j] <> nil) then
        begin
          TWorkerThread (athWorker [j]).Terminate;
          TWorkerThread (athWorker [j]).WaitFor;
    
          TWorkerThread (athWorker [j]).Free;
          athWorker [j] := nil;
        end;
    end;
    
    procedure Prologue;
    var
      j:  Integer;
    begin
      iHowMany := StrToInt (ParamStr (1));
    
      for j := 1 to MAX_THREADS do
        athWorker [j] := nil;
    
      swWatchT := TStopwatch.Create;
      swWatchT.Reset;
    
      swWatchP := TStopwatch.Create;
      swWatchP.Reset;
    end;
    
    procedure RunConvert;
    
      function __IsRunning: Boolean;
      var
        j:  Integer;
      begin
        result := False;
        for j := 1 to MAX_THREADS do
          result := result or ((athWorker [j] <> nil) and TWorkerThread (athWorker [j]).Running);
      end;
    
    begin
    
      swWatchT.Start;
    
      MultithreadStart;
    
      Sleep (1000);
      while (__isRunning) do
        Sleep (500);
    
      MultithreadStop;
    
      swWatchT.Stop;
      Writeln (#13#10, 'Total time:', swWatchT.ElapsedMilliseconds);
    end;
    
    procedure Epilogue;
    var
      j:  Integer;
    begin
      for j := 1 to iHowMany do
        Writeln ( #13#10, 'Thread # ', j, '  tot.time:', aiElapsed [j], '  fares:', aiFares [j], '  tot.parse:', aiElapsedParse [j], '  tot.destroy:', aiElapsedDestroy [j]);
    
      Readln;
    end;
    
    begin
      try
        Prologue;
        RunConvert;
        Epilogue;
    
      except
        on E: Exception do
          Writeln (E.ClassName, ': ', E.Message);
      end;
    end.
    

3 个答案:

答案 0 :(得分:1)

您是否尝试过我的可扩展内存管理器?因为Delphi(内部使用fastmm)不能很好地扩展字符串和其他与内存相关的东西: https://scalemm.googlecode.com/files/ScaleMM_v2_4_1.zip

你也可以尝试我的探查器的两种探查器模式,看看哪个部分是瓶颈: https://code.google.com/p/asmprofiler/

答案 1 :(得分:1)

该解决方案是具有英特尔®线程构建模块内存管理器的交换默认Delphi XE2或XE7内存管理器。在示例应用程序中,它缩放ca.当app为64位时,线性数量最多为16个。

update: with assumption that number of threads running is less than number of cores

这是在2cores / 4ht到12cores / 24ht的机器上测试的,运行KVM虚拟化Windows 7,内存为124GB

有趣的是虚拟化Win 7.内存分配和释放比原生Win 7快2倍。

结论:如果在多线程(超过4-8个线程)应用程序的线程中对10kB-10MB块执行大量内存分配/解除分配操作 - 只使用Intel的内存管理器。

@André:感谢小费指向我正确的方向!

这是TBB内存管理器用于测试的单元,它必须在主项目文件中的单元列表中显示为第1个.dpr

unit TBBMem;

interface

function  ScalableGetMem  (ASize: NativeInt): Pointer; cdecl; external 'tbbmalloc' name 'scalable_malloc';
procedure ScalableFreeMem (APtr: Pointer); cdecl; external 'tbbmalloc' name 'scalable_free';
function  ScalableReAlloc (APtr: Pointer; Size: NativeInt): Pointer; cdecl; external 'tbbmalloc' name 'scalable_realloc';

implementation

Function TBBGetMem (ASize: Integer): Pointer;
begin
  result := ScalableGetMem (ASize);
end;

Function TBBFreeMem (APtr: Pointer): Integer;
begin
  ScalableFreeMem (APtr);
  result := 0;
end;

Function TBBReAllocMem (APtr: Pointer; ASize: Integer): Pointer;
begin
  result := ScalableRealloc (APtr, ASize);
end;

const
  TBBMemoryManager:  TMemoryManager = ( GetMem: TBBGetmem;
                                        FreeMem: TBBFreeMem;
                                        ReAllocMem:  TBBReAllocMem; );
var
  oldMemoryManager:  TMemoryManager;

initialization
  GetMemoryManager (oldMemoryManager);
  SetMemoryManager (TBBMemoryManager);

finalization
  SetMemoryManager (oldMemoryManager);

end.

答案 2 :(得分:1)

我对FastCode MM Challenge进行了(重新)测试,结果对于TBB来说并不是那么好(在块缩小测试中也是内存异常)。

简而言之:在这个复杂的测试中,ScaleMM2和Google TCmalloc是最快的,Fastmm和ScaleMM2使用的内存最少。

Average Speed Performance: (Scaled so that the winner = 100%)
  XE6         :   70,4
  TCmalloc    :   89,1
  ScaleMem2   :  100,0
  TBBMem      :   77,8

Average Memory Performance: (Scaled so that the winner = 100%)
  XE6         :  100,0
  TCmalloc    :   29,6
  ScaleMem2   :   75,6
  TBBMem      :   38,4

FastCode挑战:https://code.google.com/p/scalemm/source/browse/#svn%2Ftrunk%2FChallenge
TBB 4.3:https://www.threadingbuildingblocks.org/download