Question

我使用Regexpal制作了以下正则表达式：？href =“（[^”]）“

我使用内置的XE-4单元“RegularExpressions”调用它：

  Matches := TRegex.Create('<a.*?href="([^"]*)"').Matches(PageSource);

目标是从html源中提取页面上的所有链接，以便我可以在不使用TWebView组件的情况下在TListView中显示它们。

作为第一个例子，链接：“http://www.splitbrain.org/_static/ico/farmfresh/”的页面源大约为142KB。运行以下代码时，峰值内存大约为530MB。相当沉重，但有效。它提供了大约1400项的匹配列表。

作为辅助示例，链接“http://www.splitbrain.org/_static/ico/fugue/”的页面源大约为338KB。当运行下面的代码时，峰值内存大约达到1.7GB，然后抛出“内存不足”异常。很明显，直接的解决方案不适用于较大的页面。

我意识到我可以逐行读取页面源，并使用正则表达式分析每一行。我怀疑这可能会对性能产生影响，但至少峰值内存应该低很多。

我很想知道，TRegex真的适合分析这类数据吗？我注意到有几个关于TRegex有未解决的bug的报道。（对不起，我引用一个直接来源，但我仍然只限于2个链接。很长时间的读者，截至今天的第一次海报。）

如果不是（似乎是这种情况），为了速度/性能和降低峰值内存使用率，最好的选择是什么？我发现PCRE可能是一个选项，但如果可能的话，我想尽可能地限制外部库。如果我要包含PCRE，是否可以通过最少的代码更改来实现？（例如，正则表达式是否兼容？）

示例代码：

function TFrmMain.FGetURLSourceAsString(const aURL: string; Depth: Integer): string;
var
  Matches: TMatchCollection;
  Url: String;
begin
  // Set UserAgent. This is needed to prevent the following error: "HTTP/1.1 403 Forbidden."
  lHTTP.Request.UserAgent := 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon)';

  // Unhandled redirects will cause a 301 error. See: http://stackoverflow.com/questions/4549809/indy-idhttp-how-to-handle-page-redirects
  lHTTP.HandleRedirects := True;
  lHTTP.RedirectMaximum := 35;

  //todo: we dont actually support https yet,. it needs an iohandler

  // If url has no http in the front, add it. Otherwise indy will complain about "unknown protocol".
  if AnsiPos('http', Url) = 0 then
    Result := lHTTP.Get('http://' + Url)
  else
    Result := lHTTP.Get(Url);

  //Analyze for possible meta refreshing:
  //Example: <meta http-equiv="refresh" content="1;url=http://urlhere">
  Matches := TRegex.Create('<meta.*?content=.*?url=([^"]*)"').Matches(Result);
  if (Matches.Count > 0) and (Depth < 5) then begin
    Url := Matches.Item[0].Groups[1].Value;
    Result := FGetURLSourceAsString(Url, Depth+1);
  end else begin
    //if Depth >= 5 then
    //todo message max depth reached
    //Just return Result as is
  end;
end;

procedure TFrmMain.BtnLinksClick(Sender: TObject);
var
  PageSource: String;
  Matches: TMatchCollection;
begin
  LvResultSpeeds.Clear();
  PageSource := FGetURLSourceAsString(EditURL.Text, 0);

  //todo this quickly jumps to 1.7 GB memory usage on the splitbrain url
  Matches := TRegex.Create('<a.*?href="([^"]*)"').Matches(PageSource);
end;

在长输入中使用Delphi中的TRegex会导致内存不足错误

0 个答案: