如何提取一堆字符串的常用词

时间:2015-12-23 02:31:30

标签: string algorithm

假设我们有这样的字符串:

#imports os modules

    import os

    #checks to see if library folder exists. if not it makes one. either way it moves the working directory to it
    if os.path.exists('library') != True:
        os.mkdir("library")
        os.chdir("library")
        print(os.getcwd())
    else:
        os.chdir("library")
        print(os.getcwd())

    #intilizes infinite loop
    while True:
        #gets operation from user
        op = input('What would you like to do? (insert new book, retrive book data, edit book data, list all books, delete a book): ')
        #if you want to make a new book entry a file is created with information gathered from the user
        if op == 'insert new book':
            title = input('what is the title of the book?: ')
            author = input('what is the author of the book?: ')
            isbn = input('what is the ISBN?: ')
            nb = open(title + ".txt", "w")
            nb.write(title )
            nb.write(author + ' end, ')
            nb.write(isbn + ' end, ')
            nb.close()
            title = ''
            author = ''
            isbn = ''
        #if you want to display the book data it displays the file content
        elif op == 'retrive book data':
            title = input('what is the title of the book?: ')
            cb = open(title + ".txt", "r")
            print(cb.read())
            cb.close()
            title = ""
        #deletes the book entry
        elif op == "delete a book":
            title = input('what is the title of the book?: ')
            os.remove(title)
            title = ""
        #here is where i need
        elif op == 'edit book data':

我想将以上3个字符串的常用词提取为:

Tommy is a very good child
Tommy has a very wonderful child
Tommy loves his very child

我该怎么做?感谢。

3 个答案:

答案 0 :(得分:2)

为了简单起见,我在这里使用lodash

var a = 'Hello world'.split(' ');
var b = 'Hello again world!'.split(' ');
var c = 'Hello tomorrow'.split(' ');

var commonWords = _.union(a, b, c);
// => ['Hello']

我之所以使用lodash只是因为它提供了一种succint方法,实际上是你尝试做的,这是一个联合,基于(例如)分隔符和变换。

联合与语言无关:用于实现联合的算法会根据您选择的语言而有所不同。

你可以在函数中使用它来定义分隔符(例如,我是否在空格中分开?)和变换(例如,单词必须是大写才能匹配?)

答案 1 :(得分:2)

您可以使用名为inverted index

的数据结构

首先,为每个输入字符串分配一个唯一的整数。然后,我们的想法是,对于输入字符串中的每个单词,您需要计算一个整数列表,表示出现该单词的字符串。请注意,只需处理所有输入字符串即可轻松完成。在您的情况下,为了在所有字符串中出现单词,您可以输出出现列表与输入中字符串数相同的条目数的单词。

有关详细信息,请参阅此处:

https://en.wikipedia.org/wiki/Inverted_index

答案 2 :(得分:1)

编辑我刚刚意识到@ Joce的评论,我把答案放在JavaScript中。但它可以很容易地适应其他语言。如果它不是JavaScript,请将其视为伪代码。

编辑2 哇!我第一次尝试时效果很好!请参阅JSFiddle.net上的工作示例。

这可能是一个非常庞大的脚本答案,但这里有:

将原始句子称为字符串数组:

var sentences = [
    "Tommy is a very good child",
    "Tommy has a very wonderful child",
    "Tommy loves his very child"
];

您可以尝试从每个数组创建一个单词数组,并将其存储在多维数组中。

var split = [];
for(var i = 0; i < sentences.length; i++) {
    split[i] = sentences[i].split(" ");
}

你也可以在这里删除单词重复,但我不知道如何在现场,但你可能会得到一些简单的算法来做到这一点。当然,除非你允许重复的单词短语。

然后,您可以创建另一个包含相同单词的数组,并按如下方式填充:

var same = [];
for(var i = 0; i < split.length; i++) {             // loop through sentences
    for(var j = 0; j < split[i].length; j++) {      // go through each sentence for new words
        if(same.indexOf(split[i][j]) <= -1) {       // if not already found
            var inAll = true;
            for(var k = 0; k < split.length; k++) { // check if in every sentence
                if(k == i) continue;
                if(split[k].indexOf(split[i][j]) <= -1) inAll = false; // if not found, make `inAll` false
            }
            if(inAll) same.push(split[i][j]);       // if found in all other sentences, add to array `same`
        }
    }
}

对不起,这是一个令人费解的答案,但它应该显示算法背后的逻辑。如果你愿意,试着改变JSFiddle上的字符串。