Question

这与我之前在How to identify substrings in the order of the string?

中的问题有关

对于给定的一组sentences和一组selected_concepts，我想按selected_concepts的顺序来标识sentences。

我用下面提供的代码做得很好。

output = []
for sentence in sentences:
    sentence_tokens = []
    for item in selected_concepts:
        index = sentence.find(item)
        if index >= 0:
             sentence_tokens.append((index, item))
    sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
    output.append(sentence_tokens)

但是，在我的真实数据集中，我有13242627 selected_concepts和1234952 sentences。因此，我想知道是否有任何方法可以优化此代码以在较短的时间内执行。据我了解，这是O（n ^ 2）。因此，我担心时间的复杂性（空间复杂性对我来说不是问题）。

下面提到一个示例。

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']

selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']

output = [['data mining','process','patterns','methods','machine learning','database systems'],['data mining','interdisciplinary subfield','information'],['data mining','knowledge discovery','databases process']]

Answer 1

如何使用预编译的ReGEx？

这里是一个例子：

import re

sentences = [
    'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
    'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
    'data mining is the analysis step of the knowledge discovery in databases process or kdd']

selected_concepts = [
    'machine learning',
    'patterns',
    'data mining',
    'methods',
    'database systems',
    'interdisciplinary subfield',
    'knowledege discovery',  # spelling error: “knowledge”
    'databases process',
    'information',
    'process']

re_concepts = [re.escape(t) for t in selected_concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall

output = [find_all_concepts(sentence) for sentence in sentences]

您得到：

[['data mining',
  'process',
  'patterns',
  'methods',
  'machine learning',
  'database systems'],
 ['data mining', 'interdisciplinary subfield', 'information', 'information'],
 ['data mining', 'databases process']]

如何在python中按字符串顺序有效地标识子字符串

1 个答案: