将字符串与更改位置的字词进行比较

时间:2016-06-16 10:18:12

标签: c# string-comparison

我必须比较一个字符串,我有问题 很多单词都处于不同的位置

例如

  

Economie - Un oeil sur les medias

  

Un oeil sur les medias - Economie

是否有一些算法可以验证重复控制与单个词匹配的百分比?

4 个答案:

答案 0 :(得分:0)

一种方法是在空格上拆分字符串,并计算生成的单词集的交集:

static double IntersectionSize(string a, string b) {
    var wordsA = a.Split(null);
    var wordsB = b.Split(null);
    if (wordsA.Length == 0 || wordsB.Length == 0) {
        // Avoid division by zero on return
        return 0;
    }
    var common = wordsA.Intersect(wordsB);
    double res = common.Sum(w => w.Length); // Total length of common words
    return 2 * res / (wordsA.Distinct().Sum(w => w.Length) + wordsB.Distinct().Sum(w => w.Length));
}

这会产生普通单词总长度的一小部分,与两个字符串中单词总长度的平均值相对应。

请注意,上述算法并不关心字符串中出现的字数。例如,"a a a""a"将返回100%匹配。

Demo.

答案 1 :(得分:0)

Longest common subsequence通常是按字符计算的,但是如果你拆分单词并且可能应用了一些智能stemming,它也可以在单词级别上运行。

答案 2 :(得分:0)

试试这个。希望它有所帮助。

    string s1 = "word1 - word2";
    string s2 = "word2 - word1";

    var s1words = new HashSet<string>(s1.Split(' ').Distinct());
    var s2words = new HashSet<string>(s2.Split(' ').Distinct());

    // number of s1 words which contains in s2
    var s1INs2 = s1words.Count(x => s2words.Contains(x));

    // number of s2 words which contains in s1
    var s2INs1 = s2words.Count(x => s1words.Contains(x));

答案 3 :(得分:0)

Jaccard Index是此用例的自然相似度量。这是一个将术语频率考虑在内的实现,因此文档“a a a a”与“a”不同:

import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Scanner;
import java.util.Set;

public class JaccardWordWiseSimilarityDemo {

    protected Map<String, Integer> termFrequencies(final String document) {
        if (document.isEmpty()) {
            return new HashMap<>(0);
        }

        final Map<String, Integer> termFrequencies = new HashMap<>();
        final Scanner terms = new Scanner(document);

        while (terms.hasNext()) {
            final String term = terms.next();
            final int termFrequency = termFrequencies.containsKey(term)
                ? 1 + termFrequencies.get(term)
                : 1;
            termFrequencies.put(term, termFrequency);
        }

        return termFrequencies;
    }

    protected int intersectionSize(
            final Map<String, Integer> lhsTermFrequencies,
            final Map<String, Integer> rhsTermFrequencies) {
        final Set<String> lhsTerms = lhsTermFrequencies.keySet();
        final Set<String> rhsTerms = rhsTermFrequencies.keySet();
        final Set<String> intersectionTerms = new HashSet<>(lhsTerms);
        intersectionTerms.retainAll(rhsTerms);
        int intersectionSize = 0;
        for (final String pair : intersectionTerms) {
            intersectionSize += Math.min(
                lhsTermFrequencies.get(pair),
                rhsTermFrequencies.get(pair));
        }
        return intersectionSize;
    }

      protected int unionSize(
            final Map<String, Integer> lhsTermFrequencies,
            final Map<String, Integer> rhsTermFrequencies) {
        final Set<String> lhsTerms = lhsTermFrequencies.keySet();
        final Set<String> rhsTerms = rhsTermFrequencies.keySet();
        final Set<String> unionTerms = new HashSet<>(lhsTerms);
        unionTerms.addAll(rhsTerms);
        int unionSize = 0;
        for (final String term : unionTerms) {
            if (lhsTermFrequencies.containsKey(term)
                    && rhsTermFrequencies.containsKey(term)) {
                unionSize += Math.max(
                    lhsTermFrequencies.get(term),
                    rhsTermFrequencies.get(term));
            }
            else if (lhsTermFrequencies.containsKey(term)) {
                unionSize += lhsTermFrequencies.get(term);
            }
            else {
                unionSize += rhsTermFrequencies.get(term);
            }
        }
        return unionSize;
    }

    protected double between(final String lhsDocument, final String rhsDocument) {
        if (lhsDocument.equals(rhsDocument)) {
            return 1.0;
        }
        if (lhsDocument.isEmpty() || rhsDocument.isEmpty()) {
            return 0.0;
        }
        final Map<String, Integer> lhsTermFrequencies = termFrequencies(lhsDocument);
        final Map<String, Integer> rhsTermFrequencies = termFrequencies(rhsDocument);
        return (double) intersectionSize(lhsTermFrequencies, rhsTermFrequencies)
             / (double) unionSize(lhsTermFrequencies, rhsTermFrequencies);
    }

    public static void main(final String... args) {
        final JaccardWordWiseSimilarityDemo similarity =
            new JaccardWordWiseSimilarityDemo();
        for (int lhsIndex = 0; lhsIndex < args.length; lhsIndex += 1) {
            final String lhsDocument = args[lhsIndex];
            for (int rhsIndex = 0; rhsIndex < args.length; rhsIndex += 1) {
                if (lhsIndex != rhsIndex) {
                    final String rhsDocument = args[rhsIndex];
                    System.out.printf("similarity(\"%s\", \"%s\") = %.7f %%%n",
                        lhsDocument, rhsDocument,
                        100.0 * similarity.between(lhsDocument, rhsDocument));
                }
            }
        }
    }
}

样品运行:

% java JaccardWordWiseSimilarityDemo "Economie - Un oeil sur les medias" "Un oeil sur les medias - Economie"
similarity("Economie - Un oeil sur les medias", "Un oeil sur les medias - Economie") = 100.0000000 %
similarity("Un oeil sur les medias - Economie", "Economie - Un oeil sur les medias") = 100.0000000 %

% java JaccardWordWiseSimilarityDemo "a a a a" "a"
similarity("a a a a", "a") = 25.0000000 %
similarity("a", "a a a a") = 25.0000000 %

如果术语外壳不重要,那么在计算它们的相似之前,你应该小写每个文档。

请注意,此示例未正确标记标点符号,因为它只是在空格上分割。如果您需要支持标点符号化,那么我建议您查看Stanford CoreNLP tokenizer