如何按字符串顺序识别子字符串?

时间:2018-12-19 04:14:26

标签: python

我的句子列表如下。

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']

我还有一些选定的概念。

selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']

现在我要按句子的顺序从seleceted_concepts中选择sentences中的概念。

即我的输出应该如下。

output = [['data mining','process','patterns','methods','machine learning','database systems'],['data mining','interdisciplinary subfield','information'],['data mining','knowledge discovery','databases process']]

我可以如下提取句子中的概念。

output = []
for sentence in sentences:
    sentence_tokens = []
    for item in selected_concepts:
        if item in sentence:
             sentence_tokens.append(item)
    output.append(sentence_tokens)

但是,我很难根据句子的顺序来组织提取的概念。在python中有什么简单的方法吗?

5 个答案:

答案 0 :(得分:1)

一种实现方法是使用.find()方法查找子字符串的位置,然后按该值排序。例如:

output = []
for sentence in sentences:
    sentence_tokens = []
    for item in selected_concepts:
        index = sentence.find(item)
        if index >= 0:
             sentence_tokens.append((index, item))
    sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
    output.append(sentence_tokens)

答案 1 :(得分:1)

您可以改用.find()和.insert()。 像这样:

output = []
for sentence in sentences:
    sentence_tokens = []
    for item in selected_concepts:
        pos = sentence.find(item)
        if pos != -1:
             sentence_tokens.insert(pos, item)
    output.append(sentence_tokens)

唯一的问题将是selected_concepts中的重叠。例如,“数据库进程”和“进程”。在这种情况下,它们将以与selected_concepts中的顺序相反的顺序结束。您可以通过以下方式解决此问题:

output = []
selected_concepts_multiplier = len(selected_concepts)
for sentence in sentences:
    sentence_tokens = []
    for k,item in selected_concepts:
        pos = sentence.find(item)
        if pos != -1:
             sentence_tokens.insert((selected_concepts_multiplier * pos) + k, item)
    output.append(sentence_tokens)

答案 2 :(得分:1)

有一个称为“ in”的内置语句。它可以检查其他字符串中是否有任何字符串。

sentences = [
'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
'data mining is the analysis step of the knowledge discovery in databases process or kdd'
]

selected_concepts = [
 'machine learning',
 'patterns',
 'data mining',
 'methods','database systems',
 'interdisciplinary subfield','knowledege discovery',
 'databases process',
 'information',
 'process'
 ]

output = [] #prepare the output
for s in sentences: #now lets check each sentences
    output.append(list()) #add a list to output, so it will become multidimensional list
    for c in selected_concepts: #check all selected_concepts
        if c in s: #if there a selected concept in a sentence
            output[-1].append(c) #then add the selected concept to the last list in output

print(output)

答案 3 :(得分:1)

您可以使用以下事实:正则表达式按从左到右的顺序搜索文本,并且不允许重叠:

import re
concept_re = re.compile(r'\b(?:' +
    '|'.join(re.escape(concept) for concept in selected_concepts) + r')\b')
output = [match
        for sentence in sentences for match in concept_re.findall(sentence)]

output
# => ['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems', 'data mining', 'interdisciplinary subfield', 'information', 'information', 'data mining', 'databases process']

这也应该比单独搜索概念快,因为算法正则表达式的使用效率更高,并且可以完全在低级代码中实现。

尽管有一个区别-如果一个概念在一个句子中重复出现,那么您的代码每个句子只会出现一个外观,而此代码将全部输出。如果这是有意义的区别,则对列表进行重复数据删除很容易。

答案 4 :(得分:1)

在这里,我使用了一个简单的re.findall方法(如果模式在字符串中匹配),那么re.findall将以匹配的模式给出输出,否则将基于我编写的代码返回一个空列表

import re

selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']

output = []

for sentence in sentences:
    matched_concepts = []
    for selected_concept in selected_concepts:
        if re.findall(selected_concept, sentence):
            matched_concepts.append(selected_concept)
    output.append(matched_concepts)
print output

输出:

[['machine learning', 'patterns', 'data mining', 'methods', 'database systems', 'process'], ['data mining', 'interdisciplinary subfield', 'information'], ['data mining', 'databases process', 'process']]