使用Python ngram。我的代码如下所示:
from ngram import NGram
print(NGram.compare("coat","hat",N=1))
结果为0.4,计算如下(我认为!):
(a-b)/a
其中:
a = total number of distinct ngrams across the two strings
b = number of ngrams NOT shared by the two strings
从我的代码中插入值等于:
(5-3)/5 = 2/5 = 0.4
如果我将代码更改为:
from ngram import NGram
print(NGram.compare("coata","hata",N=1))
结果是0.5,我不确定这个答案是如何从我上面写的公式得出的。
(6-3)/6
这给出了0.5,但这两个字符串中确实有6个不同的ngram吗?
任何人都可以对此有所了解吗?
答案 0 :(得分:1)
在“hata”,“coat”的例子中,克看起来像这样:
{'a': {'hata': 2, 'coata': 2},
'h': {'hata': 1},
'c': {'coata': 1},
't': {'hata': 1, 'coata': 1},
'o': {'coata': 1}}
因此,它会加倍'a'
。
因此,6分中有3分。
答案 1 :(得分:0)
一个更复杂,更现实的例子 -
from ngram import NGram
S1 = "bronchopneumonia"
S2 = "pneumonia"
N = 1
print(NGram.compare(S1,S2,N))
结果是0.5625
公式:
(x-y)/x
其中:
x = total number of distinct ngrams across the two strings
y = number of ngrams NOT shared by the two strings
计算,显示x和y的运行总计:
'b': {'bronchopneumonia': 1} x=1, y=1
'r': {'bronchopneumonia': 1} x=2, y=2
'o': {'bronchopneumonia': 3, 'pneumonia':1} x=5, y=4
'n': {'bronchopneumonia': 3, 'pneumonia':2} x=8, y=5
'c': {'bronchopneumonia': 1} x=9, y=6
'h': {'bronchopneumonia': 1} x=10, y=7
'p': {'bronchopneumonia': 1, 'pneumonia':1} x=11, y=7
'e': {'bronchopneumonia': 1, 'pneumonia':1} x=12, y=7
'u': {'bronchopneumonia': 1, 'pneumonia':1} x=13, y=7
'm': {'bronchopneumonia': 1, 'pneumonia':1} x=14, y=7
'i': {'bronchopneumonia': 1, 'pneumonia':1} x=15, y=7
'a': {'bronchopneumonia': 1, 'pneumonia':1} x=16, y=7
这给了我们:
(16-7)/16 = 9/16 = 0.5625
当您点击两个字符串中出现不止一次的ngram时,会出现复杂性。在这个例子中,'o'和'n'都出现了多次。
'o' appears **3 times** in 'bronchopneumonia' and **1 time** in 'pneumonia'
x now moves to 5 (from 2) and y moves to 4 (from 2).
I think, x is incremented by 3 because it is the higher number???
And I think, y is incremented by 2 because 3-1=2???
同样适用于'n':
'n' appears **3 times** in 'bronchopneumonia' and **2 times** in 'pneumonia'
x now moves to 8 (from 5) and y moves to 5 (from 4).
I think, x is incremented by 3 because it is the higher number???
And I think, y is incremented by 1 because 3-2=1???
这是我对其工作原理的理解,我不是Python专家,所以我可能误解了源代码中的内容。