字符串相似度算法返回错误值

时间:2017-04-18 09:46:41

标签: c++ algorithm

我写了一小段代码来检查字符串相似度百分比。它看起来像是:

int similarity(std::string s1, std::string s2) {
    int size = 0, sim = 0;
    if(s1==s2) {
        sim = 100;
    } else {
        if(s1 > s2)
            size = s2.size();
        else
            size = s1.size();

        for(int i = 0; i != (size); ++i) {
            if(s1[i] == s2[i])
                ++sim;
        }
    }
    return (sim/s2.size()>s1.size()?s2.size():s1.size())*10;
}

我在主要功能中测试它(我已经添加了这些' dddd'以使字符数量= 10):

std::cout << "Similarity of gananadddd and bananadddd (%): " << std::endl;
std::cout << similarity("gananadddd", "bananadddd") << "%" << std::endl;

和控制台输出:

Similarity of gananadddd and bananadddd (%):
100%

所以我认为我的代码效果不好,因为:

  • Precentage不应该是100%,因为只有一个字母改变了。
  • 在我注意到此输入百分比可能为1000%之前的某个时间,如果我将return (sim/s2.size()>s1.size()?s2.size():s1.size())*10;更改为 return (sim/s2.size()>s1.size()?s2.size():s1.size())*100;。实际上它应该是100!

我会很高兴有人会指出我犯了错误的地方。另外,我可以考虑改变算法

编辑:

我稍微修改了一下代码:

double similarity(std::string s1, std::string s2) {
    int size = 0, sim = 0;
    if(!s1.compare(s2)) {
        return 100;
    } else {
        if(!s2.compare(s1) < 0)
            size = s2.size();
        else
            size = s1.size();

        for(int i = 0; i != (size); ++i) {
            if(s1[i] == s2[i])
                ++sim;
        }
    }
    return sim / ( (s2.size()>s1.size())?s2.size():s1.size() )*100;
}

..现在收益率为0%......

2 个答案:

答案 0 :(得分:1)

您可能想要使用Levenshtein Distance计算相似度,然后您可以根据您比较的字符串计算相似度。

递归Java实施

int similarity(std::string s1, std::string s2) {
    int distance = LD(s1, s2, s1.size(), s2.size());
    return distance / (max(s1.size(), s2.size()));
}

int LD(std::string A, std::string B, int n, int m) {
    if (n == 0 && m == 0) return 0;
    if (n == 0) return m;
    if (m == 0) return n;

    return min(
            LD(A, B, n - 1, m - 1) + A[n - 1] == B[m - 1] ? 0 : 1,
            LD(A, B, n, m - 1) + 1,
            LD(A, B, n - 1, m) + 1
    );
}

int min(int a, int b, int c) {
    return min(a, min(b, c));
}

基于代码的C ++

    <?xml version="1.0" encoding="ISO-8859-1" ?>
    <!DOCTYPE doc SYSTEM "djnml-1.0b.dtd">
    <doc msize="000007622" md5="235d6d9aa0071dd0bd711e812ff918fc" sysId="sbknwsarchp01" destination="AW" distId="    " transmission-date="                " >
    <djnml publisher="DJN" docdate="20160301" product="DN" seq="4" xml:lang="en-us" >
    <head>
    <copyright year="2016" holder="text" ></copyright>
    <docdata>
    <djn>
    <djn-newswires news-source="DJDN" origin="DJ" service-id="CO" >
    <djn-press-cutout/>
    <djn-urgency>0</djn-urgency>
    <djn-mdata brand="DJ" temp-perm="P" retention="N" hot="N" original-source="DJCS" accession-number="20160301000004" page-citation="" display-date="20160301T050006.315Z" >
    <djn-coding>
    <djn-government>
    <c>G/AGD</c>
    <c>G/USG</c>
    </djn-government>
    <djn-page>
    <c>70180</c>
    <c>83567</c>
    </djn-page>
    <djn-subject>
    <c>N/DJAG</c>
    <c>N/DJCS</c>
    </djn-subject>
    <djn-market>
    <c>M/MMR</c>
    </djn-market>
    <djn-product>
    <c>P/ACMD</c>
    <c>P/FNVW</c>
    </djn-product>
    <djn-geo>
    <c>R/NME</c>
    <c>R/TN</c>
    </djn-geo>
    </djn-coding>
    </djn-mdata>
    </djn-newswires>
    </djn>
    </docdata>
    </head>
    <body>
    <headline brand-display="DJ" >
    text</headline>
    <text>
    <pre>
    text
     </pre>
    <p>
      text</p>
    <p>
      text</p>
    </text>
    </body>
    </djnml>
    </doc>
<?xml version="1.0" encoding="iso-8859-1" ?>
<!DOCTYPE doc SYSTEM "djnml-1.0b.dtd">
<doc msize="000002698" md5="81b0dd0339b8c77febf46ebdaf8ef617" sysId="sbknwsarchp01" destination="AW" distId="    " transmission-date="                " >
<djnml publisher="DJN" docdate="20160301" product="DN" seq="70" xml:lang="en-us" >
<head>
<copyright year="2016" holder="text" ></copyright>
<docdata>
<djn>
<djn-newswires news-source="DJDN" origin="DJ" service-id="CO" >
<djn-press-cutout/>
<djn-urgency>0</djn-urgency>
<djn-mdata brand="DJ" temp-perm="P" retention="N" hot="N" original-source="FW" accession-number="20160301000070" page-citation="" display-date="20160301T052632.174Z" >
<djn-coding>
<djn-company>
<c>ANZ.AU</c>
<c>ANZ.NZ</c>
<c>ANZBY</c>
</djn-company>
<djn-isin>
<c>AU000000ANZ3</c>
<c>US0525283042</c>
</djn-isin>
<djn-industry>
<c>I/BAN</c>
<c>I/BKS</c>
</djn-industry>
<djn-page>
<c>22767</c>
<c>5014</c>
<c>55115</c>
</djn-page>
<djn-subject>
<c>N/AER</c>
<c>N/BKG</c>
</djn-subject>
<djn-market>
<c>M/FCL</c>
<c>M/NND</c>
</djn-market>
<djn-product>
<c>P/ABO</c>
<c>P/AEI</c>
</djn-product>
<djn-geo>
<c>R/ASA</c>
<c>R/FE</c>
</djn-geo>
</djn-coding>
</djn-mdata>
</djn-newswires>
</djn>
</docdata>
</head>
<body>
<headline brand-display="DJ" >
text</headline>
<text>
<pre>
 </pre>
<p>
     text </p>
<pre>

Editor JSM 

 </pre>
<p>
  text</p>
<p>
  text</p>
</text>
</body>
</djnml>
</doc>

答案 1 :(得分:-2)

使用功能

std::string::compare()

如果你跑;

if (!s.compare(t)) {
    // 's' and 't' are equal.
}

返回一个int:

如果s和t相等,

等于零,

如果s小于t,

小于零,

如果s大于t,则

大于零。

要详细说明用例,如果您对两个字符串彼此之间的关系(更少或更多)感兴趣,那么compare()会非常有用。