Question

假设我们有这样的字符串：

#imports os modules

    import os

    #checks to see if library folder exists. if not it makes one. either way it moves the working directory to it
    if os.path.exists('library') != True:
        os.mkdir("library")
        os.chdir("library")
        print(os.getcwd())
    else:
        os.chdir("library")
        print(os.getcwd())

    #intilizes infinite loop
    while True:
        #gets operation from user
        op = input('What would you like to do? (insert new book, retrive book data, edit book data, list all books, delete a book): ')
        #if you want to make a new book entry a file is created with information gathered from the user
        if op == 'insert new book':
            title = input('what is the title of the book?: ')
            author = input('what is the author of the book?: ')
            isbn = input('what is the ISBN?: ')
            nb = open(title + ".txt", "w")
            nb.write(title )
            nb.write(author + ' end, ')
            nb.write(isbn + ' end, ')
            nb.close()
            title = ''
            author = ''
            isbn = ''
        #if you want to display the book data it displays the file content
        elif op == 'retrive book data':
            title = input('what is the title of the book?: ')
            cb = open(title + ".txt", "r")
            print(cb.read())
            cb.close()
            title = ""
        #deletes the book entry
        elif op == "delete a book":
            title = input('what is the title of the book?: ')
            os.remove(title)
            title = ""
        #here is where i need
        elif op == 'edit book data':

我想将以上3个字符串的常用词提取为：

Tommy is a very good child
Tommy has a very wonderful child
Tommy loves his very child

我该怎么做？感谢。

Answer 1

为了简单起见，我在这里使用lodash：

var a = 'Hello world'.split(' ');
var b = 'Hello again world!'.split(' ');
var c = 'Hello tomorrow'.split(' ');

var commonWords = _.union(a, b, c);
// => ['Hello']

我之所以使用lodash只是因为它提供了一种succint方法，实际上是你尝试做的，这是一个联合，基于（例如）分隔符和变换。

联合与语言无关：用于实现联合的算法会根据您选择的语言而有所不同。

你可以在函数中使用它来定义分隔符（例如，我是否在空格中分开？）和变换（例如，单词必须是大写才能匹配？）

Answer 2

您可以使用名为inverted index

的数据结构

首先，为每个输入字符串分配一个唯一的整数。然后，我们的想法是，对于输入字符串中的每个单词，您需要计算一个整数列表，表示出现该单词的字符串。请注意，只需处理所有输入字符串即可轻松完成。在您的情况下，为了在所有字符串中出现单词，您可以输出出现列表与输入中字符串数相同的条目数的单词。

有关详细信息，请参阅此处：

https://en.wikipedia.org/wiki/Inverted_index

Answer 3

编辑我刚刚意识到@ Joce的评论，我把答案放在JavaScript中。但它可以很容易地适应其他语言。如果它不是JavaScript，请将其视为伪代码。

编辑2 哇！我第一次尝试时效果很好！请参阅JSFiddle.net上的工作示例。

这可能是一个非常庞大的脚本答案，但这里有：

将原始句子称为字符串数组：

var sentences = [
    "Tommy is a very good child",
    "Tommy has a very wonderful child",
    "Tommy loves his very child"
];

您可以尝试从每个数组创建一个单词数组，并将其存储在多维数组中。

var split = [];
for(var i = 0; i < sentences.length; i++) {
    split[i] = sentences[i].split(" ");
}

你也可以在这里删除单词重复，但我不知道如何在现场，但你可能会得到一些简单的算法来做到这一点。当然，除非你允许重复的单词短语。

然后，您可以创建另一个包含相同单词的数组，并按如下方式填充：

var same = [];
for(var i = 0; i < split.length; i++) {             // loop through sentences
    for(var j = 0; j < split[i].length; j++) {      // go through each sentence for new words
        if(same.indexOf(split[i][j]) <= -1) {       // if not already found
            var inAll = true;
            for(var k = 0; k < split.length; k++) { // check if in every sentence
                if(k == i) continue;
                if(split[k].indexOf(split[i][j]) <= -1) inAll = false; // if not found, make `inAll` false
            }
            if(inAll) same.push(split[i][j]);       // if found in all other sentences, add to array `same`
        }
    }
}

对不起，这是一个令人费解的答案，但它应该显示算法背后的逻辑。如果你愿意，试着改变JSFiddle上的字符串。

如何提取一堆字符串的常用词

3 个答案: