有没有人知道如何计算或计算文档中唯一短语数量的代码? (单个单词,两个单词短语,三个单词短语)。
由于
我正在寻找的例子:
我的意思是我有一个文本文档,我需要看看最流行的单词短语是什么。示例文本
我把车开到了洗车场。
I : 1 took : 1 the : 2 car: 2 to : 1 wash : 1 I took : 1 took the : 1 the car : 2 car to : 1 to the : 1 car wash : 1 I took the : 1 took the car : 1 the car to : 1 car to the : 1 to the car : 1 the car wash : 1 I took the car to : 1 took the car to the : 1 the car to the car : 1 car to the car wash : 1
我需要这个短语,以及它显示的计数。
任何帮助将不胜感激。我发现的壁橱是来自http://tools.seobook.com/general/keyword-density/source.php
的PHP脚本我以前有一些代码,但我找不到它。
答案 0 :(得分:2)
以下是一些解决您问题的初始代码。
function CountWordSequences(const s:string; Counts:TStrings = nil):TStrings;
var
words, seqs : TStrings;
nw,i,j:integer;
t :string;
begin
if Counts=nil then Counts:=TStringList.Create;
words:=TStringList.Create; // build a list of all words
words.DelimitedText:=s;
seqs:=TStringList.Create;
for nw:=1 to words.Count do // build a list of all word sequences
begin
for i:=0 to words.Count-nw do
begin
t:='';
for j:=0 to nw-1 do
begin
t:=t+words[i+j];
if j<>nw-1 then t:=t+' ';
end;
seqs.Add(t);
end;
end;
words.Destroy;
for i:=0 to seqs.Count-1 do // count repeated sequences
begin
j:=Counts.IndexOf(seqs.Strings[i]);
if j=-1 then
Counts.AddObject(seqs.Strings[i],TObject(1))
else
Counts.Objects[j] := TObject(Succ(Integer(Counts.Objects[j])));
end;
seqs.Destroy;
result:=Counts;
end;
您需要为现实世界的制作详细说明此代码,例如,通过识别更多的单词分隔符(不仅是空格),以及实现某种不区分大小写的方式。
要测试它,在表单中放置一个Button,一个EntryField和一个备忘录,并添加以下代码。
procedure TForm1.Button1Click(Sender: TObject);
var i:integer; l:TStrings;
begin
l:=CountWordSequences(edit1.Text,TStringList.Create);
for i:=1 to l.count do
memo1.Lines.Add('"'+l.Strings[i-1]+'": '+inttostr(Integer(l.Objects[i-1])));
end;
我首先尝试使用I took the car to the car wash
给出
"I": 1
"took": 1
"the": 2
"car": 2
"to": 1
"wash.": 1
"I took": 1
"took the": 1
"the car": 2
"car to": 1
"to the": 1
"car wash.": 1
"I took the": 1
"took the car": 1
"the car to": 1
"car to the": 1
"to the car": 1
"the car wash.": 1
"I took the car": 1
"took the car to": 1
"the car to the": 1
"car to the car": 1
"to the car wash.": 1
"I took the car to": 1
"took the car to the": 1
"the car to the car": 1
"car to the car wash.": 1
"I took the car to the": 1
"took the car to the car": 1
"the car to the car wash.": 1
"I took the car to the car": 1
"took the car to the car wash.": 1
"I took the car to the car wash.": 1
答案 1 :(得分:0)
来自Delphi Basics网站。
var
position : Integer;
begin
// Look for the word 'Cat' in a sentence
// Note : that this search is case sensitive, so that
// the first 'cat' is not matched
position := AnsiPos('Cat', 'The cat sat on the Cat mat');
if position = 0
then ShowMessage('''Cat'' not found in the sentence')
else ShowMessage('''Cat'' was found at character '+IntToStr(position));
end;
也许会有所帮助
答案 2 :(得分:0)
可能的组合数量很快就会增加。假设在一种语言中主流使用30000个单词,那么3个短语组合的数量大小为30000 ^ 3
无论如何,零级实现将是构建一个(哈希)单词列表,如果需要对非常常见的单词(等等)进行过滤以减少短语的数量。你可能想要做的其他事情是将复数减少为单身,删除尾随,套管等。
然后逐字逐句地传递文字(标记器样式),跳过常用字,并简单地保存你所遇到的短语的有序列表,并希望你的记忆不会耗尽,因为德尔福没有64-位版:)
Knuth没有关于组合的整本书吗?
答案 3 :(得分:0)
这就是我要解决问题的方法。假设每次通过数据文件都会为下一步创建一个新的数据文件。提到的控制字符可以是任何不会自然出现在数据中的字符。编写控制字符时,不要写重复项。
重复为每个列表添加另一个单词级别,直到您获得一个空列表或者您想要支持的最大短语。
这种方法意味着您最常见的短语永远不会包含较少使用的较小短语。