Question

我有800个句子的数组。我想从数组中删除所有重复项（具有相同确切单词的句子，但顺序不同）。因此，例如“这是一个句子”和“这是一个句子”是重复的。其中只有一个应该保留在数组中（哪一个无关紧要）。

我的第一个想法是将它们逐个复制到一个新数组，每次检查新数组中是否已存在该句子。我将通过循环遍历新数组中的所有元素并使用以下代码来比较句子来完成此任务：

Using jQuery to compare two arrays of Javascript objects

但是，这在计算方面很快变得过于密集，导致javascript引擎无响应。

有关如何使算法更有效的任何想法将不胜感激。

Answer 1

使用Object作为查找以获得快速哈希表支持的检查。这意味着使用字符串作为您的键类型，这意味着首先对单词的大小写/排序/等进行标准化，以获得每个单词组合的唯一键。

// Get key for sentence, removing punctuation and normalising case and word order
// eg 'Hello, a  horse!' -> 'x_a hello horse'
// the 'x_' prefix is to avoid clashes with any object properties with undesirable
// special behaviour (like prototype properties in IE) and get a plain lookup
//
function getSentenceKey(sentence) {
    var trimmed= sentence.replace(/^\s+/, '').replace(/\s+$/, '').toLowerCase();
    var words= trimmed.replace(/[^\w\s]+/g, '').replace(/\s+/, ' ').split(' ');
    words.sort();
    return 'x_'+words.join(' ');
}

var lookup= {};
for (var i= sentences.length; i-->0;) {
    var key= getSentenceKey(sentences[i]);
    if (key in lookup)
        sentences.splice(i, 1);
    else
        lookup[key]= true;
}

如果你需要支持非ASCII字符，则需要一些工作（\w在JS中不能很好地与Unicode配合使用，并且在某些语言中构成单词的构成是一个困难的问题）。另外，“foo bar foo”与“bar bar foo”的句子相同吗？

Answer 2

这个怎么样？

va = ["this is a sentence", "sentence this is", "sentence this is a"]
vb = {} // dictionary of combined sorted words in each sentence
vc = [] // output list of sentences without duplicates 

for (i=0;i<va.length;i++){
    // split sentence, sort words, and recombine (this is a sentence => a is sentence this)
    var combined = va[i].split(" ").sort().join(" "); 

    if (!vb[combined]){       // if set of combined sorted words doesn't exist already
        vc.push(va[i]);      // sentence isn't duplicated, push to output list
        vb[combined] = true  // add set to dictionary
    }
}

alert(vc.join("\n"))

Answer 3

这是尝试的东西。我没有在大型阵列上测试它的性能，但我认为应该没问题。 不需要jQuery。

function removeDuplicates(array)
{
    var new_array = [];
    for(var i=0; i<array.length; i++){
        // convert current sentence to sorted, lowercase string
        var sen = array[i].split(" ").sort().join(" ");
        if(new_array.indexOf(sen) == -1){
            // no matches, let's add it!
            new_array.push(sen);
        }
    }
    return new_array;
}

Array.prototype.indexOf = function(item, optional_start_index)
{
    for(var i=optional_start_index||0; i<this.length; i++){
        if(this[i] == item) return i;
    }
    return -1;
}

像这样使用：

var a = ["this is a name", "name is this a", "this name is a", "hello there"];
var clean_array = removeDuplicates(a);
alert(clean_array); // outputs: a is name this,hello there

Answer 4

JavaScript similar_text计算两个字符串之间的相似性

Answer 5

对句子数组进行排序，然后遍历它并删除一个项目（如果它与前一个项目相同）：

texts.sort();
for(var i = 1; i < texts.length; i++){
    if(texts[i] === texts[i-1]){
        texts.splice(i,1);
        i--;
     }
}

我在一个包含800个字符串的数组中对此进行了测试，看起来相当快。

编辑：对不起，没有仔细阅读你的问题

Answer 6

这是一个非常简单的实现，它利用了一些jQuery。

Check the demo here ->

来源：

var arr = ["This is a sentence", "Is this a sentence", "potatoes"];
var newArr = [];
var sortedArr = [];
$.each(arr, function(i) {
    var temp = this.toLowerCase().split(" ").sort(function(a,b) {
            return a > b;
    }).join(' ');
    if ($.inArray(temp, sortedArr) == -1) {
        sortedArr.push(temp);
        newArr.push(arr[i]);   
    }
});

//output
$.each(newArr, function() {
    document.write(this + '<br />');
});

它使用三个数组：一个源，一个要匹配的排序句子的集合，以及输出数组。通过按空格分割句子，转换为小写，按字母顺序排序单词，然后重建句子字符串来执行匹配。如果之前已经看过该特定组合，则不会将其添加到结果中。如果没有，则添加。

最后的循环只输出结果数组。

使用javascript删除重复的字符串

6 个答案: