我必须比较一个字符串,我有问题 很多单词都处于不同的位置
例如
Economie - Un oeil sur les medias
或
Un oeil sur les medias - Economie
是否有一些算法可以验证重复控制与单个词匹配的百分比?
答案 0 :(得分:0)
一种方法是在空格上拆分字符串,并计算生成的单词集的交集:
static double IntersectionSize(string a, string b) {
var wordsA = a.Split(null);
var wordsB = b.Split(null);
if (wordsA.Length == 0 || wordsB.Length == 0) {
// Avoid division by zero on return
return 0;
}
var common = wordsA.Intersect(wordsB);
double res = common.Sum(w => w.Length); // Total length of common words
return 2 * res / (wordsA.Distinct().Sum(w => w.Length) + wordsB.Distinct().Sum(w => w.Length));
}
这会产生普通单词总长度的一小部分,与两个字符串中单词总长度的平均值相对应。
请注意,上述算法并不关心字符串中出现的字数。例如,"a a a"
和"a"
将返回100%匹配。
答案 1 :(得分:0)
Longest common subsequence通常是按字符计算的,但是如果你拆分单词并且可能应用了一些智能stemming,它也可以在单词级别上运行。
答案 2 :(得分:0)
试试这个。希望它有所帮助。
string s1 = "word1 - word2";
string s2 = "word2 - word1";
var s1words = new HashSet<string>(s1.Split(' ').Distinct());
var s2words = new HashSet<string>(s2.Split(' ').Distinct());
// number of s1 words which contains in s2
var s1INs2 = s1words.Count(x => s2words.Contains(x));
// number of s2 words which contains in s1
var s2INs1 = s2words.Count(x => s1words.Contains(x));
答案 3 :(得分:0)
Jaccard Index是此用例的自然相似度量。这是一个将术语频率考虑在内的实现,因此文档“a a a a”与“a”不同:
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Scanner;
import java.util.Set;
public class JaccardWordWiseSimilarityDemo {
protected Map<String, Integer> termFrequencies(final String document) {
if (document.isEmpty()) {
return new HashMap<>(0);
}
final Map<String, Integer> termFrequencies = new HashMap<>();
final Scanner terms = new Scanner(document);
while (terms.hasNext()) {
final String term = terms.next();
final int termFrequency = termFrequencies.containsKey(term)
? 1 + termFrequencies.get(term)
: 1;
termFrequencies.put(term, termFrequency);
}
return termFrequencies;
}
protected int intersectionSize(
final Map<String, Integer> lhsTermFrequencies,
final Map<String, Integer> rhsTermFrequencies) {
final Set<String> lhsTerms = lhsTermFrequencies.keySet();
final Set<String> rhsTerms = rhsTermFrequencies.keySet();
final Set<String> intersectionTerms = new HashSet<>(lhsTerms);
intersectionTerms.retainAll(rhsTerms);
int intersectionSize = 0;
for (final String pair : intersectionTerms) {
intersectionSize += Math.min(
lhsTermFrequencies.get(pair),
rhsTermFrequencies.get(pair));
}
return intersectionSize;
}
protected int unionSize(
final Map<String, Integer> lhsTermFrequencies,
final Map<String, Integer> rhsTermFrequencies) {
final Set<String> lhsTerms = lhsTermFrequencies.keySet();
final Set<String> rhsTerms = rhsTermFrequencies.keySet();
final Set<String> unionTerms = new HashSet<>(lhsTerms);
unionTerms.addAll(rhsTerms);
int unionSize = 0;
for (final String term : unionTerms) {
if (lhsTermFrequencies.containsKey(term)
&& rhsTermFrequencies.containsKey(term)) {
unionSize += Math.max(
lhsTermFrequencies.get(term),
rhsTermFrequencies.get(term));
}
else if (lhsTermFrequencies.containsKey(term)) {
unionSize += lhsTermFrequencies.get(term);
}
else {
unionSize += rhsTermFrequencies.get(term);
}
}
return unionSize;
}
protected double between(final String lhsDocument, final String rhsDocument) {
if (lhsDocument.equals(rhsDocument)) {
return 1.0;
}
if (lhsDocument.isEmpty() || rhsDocument.isEmpty()) {
return 0.0;
}
final Map<String, Integer> lhsTermFrequencies = termFrequencies(lhsDocument);
final Map<String, Integer> rhsTermFrequencies = termFrequencies(rhsDocument);
return (double) intersectionSize(lhsTermFrequencies, rhsTermFrequencies)
/ (double) unionSize(lhsTermFrequencies, rhsTermFrequencies);
}
public static void main(final String... args) {
final JaccardWordWiseSimilarityDemo similarity =
new JaccardWordWiseSimilarityDemo();
for (int lhsIndex = 0; lhsIndex < args.length; lhsIndex += 1) {
final String lhsDocument = args[lhsIndex];
for (int rhsIndex = 0; rhsIndex < args.length; rhsIndex += 1) {
if (lhsIndex != rhsIndex) {
final String rhsDocument = args[rhsIndex];
System.out.printf("similarity(\"%s\", \"%s\") = %.7f %%%n",
lhsDocument, rhsDocument,
100.0 * similarity.between(lhsDocument, rhsDocument));
}
}
}
}
}
样品运行:
% java JaccardWordWiseSimilarityDemo "Economie - Un oeil sur les medias" "Un oeil sur les medias - Economie"
similarity("Economie - Un oeil sur les medias", "Un oeil sur les medias - Economie") = 100.0000000 %
similarity("Un oeil sur les medias - Economie", "Economie - Un oeil sur les medias") = 100.0000000 %
% java JaccardWordWiseSimilarityDemo "a a a a" "a"
similarity("a a a a", "a") = 25.0000000 %
similarity("a", "a a a a") = 25.0000000 %
如果术语外壳不重要,那么在计算它们的相似之前,你应该小写每个文档。
请注意,此示例未正确标记标点符号,因为它只是在空格上分割。如果您需要支持标点符号化,那么我建议您查看Stanford CoreNLP tokenizer。