我有一个句子列表,如下。
sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
'data mining is the analysis step of the knowledge discovery in databases process or kdd']
我还有一个概念列表,按字母顺序分组,如下所示。
concepts = ['data mining', 'database systems', 'databases process',
'interdisciplinary subfield', 'information', 'knowledege discovery',
'methods', 'machine learning', 'patterns', 'process']
我想按句子的顺序在concepts
中标识sentences
。
因此,根据以上示例,输出应为;
output = [['data mining','process','patterns','methods','machine learning','database systems'],
['data mining','interdisciplinary subfield','information'],
['data mining','knowledge discovery','databases process']]
我正在使用以下代码来做到这一点。
for sentence in sentences:
sentence_tokens = []
for item in concepts:
index = sentence.find(item)
if index >= 0:
sentence_tokens.append((index, item))
sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
counting = counting+1
print(counting)
output.append(sentence_tokens)
但是,这确实很慢,根据我的时间计算,处理我的数据集大约需要半个月的时间。
我的概念列表大约长13,242,627(即len(concepts)
),而我的概念数大约为350,000(即len(sentences)
)。
因此,我只是想知道是否可以通过使用字母顺序来搜索我的概念列表的一部分?否则,如果我在句子中搜索概念(例如,concept in concepts
和内部循环为for sentence in sentences
的话,将会减少时间
答案 0 :(得分:3)
起初我考虑过实现一些string-searching algorithm,但是后来我意识到regexp模块可能已经包含了一个不错的模块。
sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']
concepts = ['data mining', 'database systems', 'databases process', 'interdisciplinary subfield', 'information', 'knowledege discovery','methods', 'machine learning', 'patterns', 'process']
import re
re_group = "(" + "|".join(map(re.escape, concepts)) + ")"
output = [re.findall(re_group, sentence) for sentence in sentences]
print(output)
(感谢@warvariuc建议将re.escape和code-golfing包含在map中)
答案 1 :(得分:2)
有一个称为“ trie”或“ prefix tree”的数据结构,您可能会觉得有用(https://en.wikipedia.org/wiki/Trie)。该解决方案将遍历句子中与最长词匹配中最长的前缀匹配的单词,如果没有前缀匹配,则跳至下一个单词。在最坏的情况下,查找将为O(m); m是要匹配的字符串的长度。这意味着您将找到所有概念,其代价是“句子”长度的最坏情况。相比之下,您的算法需要花费一定数量的概念列表长度,这有点吓人。