Question

我的html源包含大约1000个微博（每行一条推文）。大多数推文如下所示。我正在使用delphi备忘录尝试使用Pos函数和删除函数来删除html标记但是失败了。

<div id='tweetText'> RT <a onmousedown="return touch(this.href,0)" href="http://twitter.com/HighfashionUK">@HighfashionUK</a> RT: Surprise goody bag up 4 grabs, Ok. <a onmousedown="return touch(this.href,0)" href="http://plixi.com/p/57846587">http://plixi.com/p/57846587</a> when we get 150</div>

我想删除html标记，只有：

RT: Surprise goody bag up 4 grabs, Ok. http://plixi.com/p/57846587 when we get 150

如何在delphi中提取此类文本？

非常感谢你。

更新：

Cosmin Prund是对的。我错误地跳过了一部分。我想要的是：

RT @HighfashionUK  RT: Surprise goody bag up 4 grabs, Ok. http://plixi.com/p/57846587 when we get 150

Cosmin Prund很棒。

Answer 1

由于所有HTML标记都在<和>之间，因此剥离标记的例程可以像这样简单地编写。希望这是你想要的，因为正如你在我的评论中所看到的，@HighfashionUK存在问题 - 你的例子跳过了，不知道为什么。

function StripHtmlMarkup(const source:string):string;
var i, count: Integer;
    InTag: Boolean;
    P: PChar;
begin
  SetLength(Result, Length(source));
  P := PChar(Result);
  InTag := False;
  count := 0;
  for i:=1 to Length(source) do
    if InTag then
      begin
        if source[i] = '>' then InTag := False;
      end
    else
      if source[i] = '<' then InTag := True
      else
        begin
          P[count] := source[i];
          Inc(count);
        end;
  SetLength(Result, count);
end;

如何从这种类型的html源中提取文本？

1 个答案: