Question

有没有人知道如何计算或计算文档中唯一短语数量的代码？（单个单词，两个单词短语，三个单词短语）。

由于

我正在寻找的例子：我的意思是我有一个文本文档，我需要看看最流行的单词短语是什么。示例文本

我把车开到了洗车场。

I : 1
took : 1
the : 2
car: 2
to : 1
wash : 1
I took : 1
took the : 1
the car : 2
car to : 1
to the : 1
car wash : 1
I took the : 1
took the car : 1
the car to : 1
car to the : 1
to the car : 1
the car wash : 1
I took the car to : 1
took the car to the : 1
the car to the car : 1
car to the car wash : 1

我需要这个短语，以及它显示的计数。

任何帮助将不胜感激。我发现的壁橱是来自http://tools.seobook.com/general/keyword-density/source.php

的PHP脚本

我以前有一些代码，但我找不到它。

Answer 1

以下是一些解决您问题的初始代码。

function CountWordSequences(const s:string; Counts:TStrings = nil):TStrings;
var
  words, seqs : TStrings;
  nw,i,j:integer;
  t :string;
begin
  if Counts=nil then Counts:=TStringList.Create;
  words:=TStringList.Create;        // build a list of all words
  words.DelimitedText:=s;
  seqs:=TStringList.Create;
  for nw:=1 to words.Count do       // build a list of all word sequences
   begin
    for i:=0 to words.Count-nw do
     begin
      t:='';
      for j:=0 to nw-1 do
       begin
        t:=t+words[i+j];
        if j<>nw-1 then t:=t+' ';
       end;
      seqs.Add(t);
     end;
   end;
  words.Destroy;
  for i:=0 to seqs.Count-1 do         // count repeated sequences
   begin
    j:=Counts.IndexOf(seqs.Strings[i]);
    if j=-1 then
      Counts.AddObject(seqs.Strings[i],TObject(1))
    else
      Counts.Objects[j] := TObject(Succ(Integer(Counts.Objects[j])));
   end;
  seqs.Destroy;
  result:=Counts;
end;

您需要为现实世界的制作详细说明此代码，例如，通过识别更多的单词分隔符（不仅是空格），以及实现某种不区分大小写的方式。

要测试它，在表单中放置一个Button，一个EntryField和一个备忘录，并添加以下代码。

procedure TForm1.Button1Click(Sender: TObject);
var i:integer; l:TStrings;
 begin
  l:=CountWordSequences(edit1.Text,TStringList.Create);
  for i:=1 to l.count do
    memo1.Lines.Add('"'+l.Strings[i-1]+'": '+inttostr(Integer(l.Objects[i-1])));
 end;

我首先尝试使用I took the car to the car wash

给出

"I": 1
"took": 1
"the": 2
"car": 2
"to": 1
"wash.": 1
"I took": 1
"took the": 1
"the car": 2
"car to": 1
"to the": 1
"car wash.": 1
"I took the": 1
"took the car": 1
"the car to": 1
"car to the": 1
"to the car": 1
"the car wash.": 1
"I took the car": 1
"took the car to": 1
"the car to the": 1
"car to the car": 1
"to the car wash.": 1
"I took the car to": 1
"took the car to the": 1
"the car to the car": 1
"car to the car wash.": 1
"I took the car to the": 1
"took the car to the car": 1
"the car to the car wash.": 1
"I took the car to the car": 1
"took the car to the car wash.": 1
"I took the car to the car wash.": 1

Answer 2

来自Delphi Basics网站。

var
  position : Integer;

begin
  // Look for the word 'Cat' in a sentence
  // Note : that this search is case sensitive, so that
  //        the first 'cat' is not matched
  position := AnsiPos('Cat', 'The cat sat on the Cat mat');
  if position = 0
  then ShowMessage('''Cat'' not found in the sentence')
  else ShowMessage('''Cat'' was found at character '+IntToStr(position));
end;

也许会有所帮助

Answer 3

可能的组合数量很快就会增加。假设在一种语言中主流使用30000个单词，那么3个短语组合的数量大小为30000 ^ 3

无论如何，零级实现将是构建一个（哈希）单词列表，如果需要对非常常见的单词（等等）进行过滤以减少短语的数量。你可能想要做的其他事情是将复数减少为单身，删除尾随，套管等。

然后逐字逐句地传递文字（标记器样式），跳过常用字，并简单地保存你所遇到的短语的有序列表，并希望你的记忆不会耗尽，因为德尔福没有64-位版：）

Knuth没有关于组合的整本书吗？

Answer 4

这就是我要解决问题的方法。假设每次通过数据文件都会为下一步创建一个新的数据文件。提到的控制字符可以是任何不会自然出现在数据中的字符。编写控制字符时，不要写重复项。

跑过你的记录并统计每个单词分别。
跑过你的再次记录并替换使用的任何单词只有一个控制角色，添加到新列表中的那对发生（单词A B C成为项目A B. 和项目B C）。控制字符充当硬分隔符。控制字符之间的任何单词也应该被转换，因为它不能转换为一对。
运行再次通过你的文件替换任何只用过一次的对一个控制字符，添加到一个新的列出发生的任何三胞胎。转换控制字符之间的对以控制字符。

重复为每个列表添加另一个单词级别，直到您获得一个空列表或者您想要支持的最大短语。

这种方法意味着您最常见的短语永远不会包含较少使用的较小短语。

德尔福短语计数/关键字密度

4 个答案: