如何计算两个句子之间的相似性(句法和语义)

时间:2010-09-07 03:41:36

标签: matlab semantics text-mining

我应该每次拿两个句子并计算它们是否相似。我的意思是类似于语法和语义。

  

INPUT1:奥巴马签署法律。         奥巴马签署了一项新法律。

     

INPUT2:         巴士停在这里。         车停在这里。

     

INPUT3:在纽约开火。          纽约被烧毁了。

     

INPUT4:在纽约开火。          50人死于纽约火灾。

我不想将本体树用作灵魂。我写了一个代码来计算句子之间的Levenshtein distance(LD),然后决定第二句话是否

  • 可以忽略(INPUT1和2),
  • 应该替换第一句(INPUT 3)或
  • 与第一句(INPUT4)一起存储。

我对代码不满意,因为LD只计算语法级别(其他方法是什么?)。如何将语义结合起来(比如公共汽车是一种载体?)。

代码在这里:

%# As the difference is computed, a decision is made on the new event
%# (string 2) to be ignored, to replace existing event (string 1) or to be
%# stored separately. The higher the LD metric, the higher the difference
%# between two strings. Of course, lower difference indices either identical
%# or similar events. However, the higher difference indicates the new event
%# as a fresh event.

%#.........................................................................
%# Calculating the LD between two strings of events.
%#.........................................................................
L1=length(str1)+1;
L2=length(str2)+1;
L=zeros(L1,L2);   %# Initializing the new length.

g=+1;             %# just constant
m=+0;             %# match is cheaper, we seek to minimize
d=+1;             %# not-a-match is more costly.

% do BC's
L(:,1)=([0:L1-1]*g)';
L(1,:)=[0:L2-1]*g;

m4=0;             %# loop invariant
%# Calculating required edits.
for idx=2:L1;
    for idy=2:L2
        if(str1(idx-1)==str2(idy-1))
            score=m;
        else
            score=d;
        end
        m1=L(idx-1,idy-1) + score;
        m2=L(idx-1,idy) + g;
        m3=L(idx,idy-1) + g;
        L(idx,idy)=min(m1,min(m2,m3)); % only minimum edits allowed.
    end
end
%# The LD between two strings.
D=L(L1,L2);

%#....................................................................
%# Making decision on what to do with the new event (string 2).
%#...................................................................
if (D<=4)     %# Distance is so less that string 2 seems identical to string 1.
    store=str1;        %# Hence string 2 is ignored. String 1 remains stored.
elseif (D>=5 && D<=15) %# Distance is larger to be identical but not enough to
    %# make string 2 an individual event.
    store= str2;       %# String 2 is somewhat similar to string 1.
                       %# So, string 1 is replaced with string 2 and stored.
else
    %# For all other distances, string 2 is stored along with string 1.
    store={str1; str2};
end

感谢任何帮助。

1 个答案:

答案 0 :(得分:2)

“语义”。 没有简单的教科书算法。自然语言(尤其是英语)是一种非常复杂和变幻无常的野兽。让我们看一下所提供的案例(只是其中的一小部分):

INPUT1: Obama signs the law. A new law is signed by Obama.

签署法律使其成为一项“新法”。

INPUT2: A Bus is stopped here. A vehicle stops here.

需要知道公共汽车是一种类型的车辆以及某种时间关系。另外,如果总线 停止但通常不会停止或不再停止怎么办?它可以采取多种方式。

INPUT3: Fire in NY. NY is burnt down.

需要知道火灾可以摧毁东西。

INPUT4: Fire in NY. 50 died in NY fire.

需要知道火灾可以杀死东西(见下)。需要将“新闻标题”(50 WHAT?)与人联系起来。大脑可以做得有点琐碎。计算机程序不是大脑。

我不是英语专业: - )