在字符串python中查找匹配的短语和单词

时间:2014-09-22 15:21:48

标签: python string nlp

使用python,从给定字符串中提取常用短语或单词的最有效方法是什么?

例如,

string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"

会回来:

["a","time","there","was a very","called Jack"] 

如何有效地进行此操作(在我的情况下,我需要在数千个1000字的文档中执行此操作)?

4 个答案:

答案 0 :(得分:2)

您可以split每个字符串,然后intersect set s。

string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
set(string1.split()).intersection(set(string2.split()))

结果

set(['a', 'very', 'Jack', 'time', 'was', 'called'])

请注意,这仅匹配单个单词。你必须更具体地考虑你会考虑什么"短语"。最长的连续匹配子串?这可能会变得更加复杂。

答案 1 :(得分:1)

在自然语言处理中,您通常使用n-grams从句子中提取常见的模式和序列。 在python中,您可以使用优秀的NLTK模块。

为了计算和查找最常见的内容,您可以使用collections.Counter

以下是2克的示例:

from nltk.util import ngrams
from collections import Counter
from itertools import chain

string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"

n = 2
ngrams1= ngrams(string1.split(" "), n)
ngrams2= ngrams(string2.split(" "), n)

counter= Counter(chain(ngrams1,ngrams2))       #count occurrences of each n-gram
print [k[0] for k,v in counter.items() if v>1] #print all ngrams that come up more than once

输出:

[('called', 'Jack'), ('was', 'a'), ('a', 'very')]

输出n=3

[('was', 'a', 'very')]

输出n=1(没有元组):

['Jack', 'a', 'was', 'time', 'called', 'very']

答案 2 :(得分:1)

这是一个经典的动态编程问题。您需要做的就是为string1构建一个后缀树,用词而不是字母(这是通常的公式)。这是一个illustrative example of a suffix tree

  1. 将树中的所有节点标记为s1
  2. 逐个插入string2的所有后缀。
  3. 步骤2中后缀通过的所有节点都标记为s2
  4. 在步骤2中创建的任何新节点也标记为s2
  5. 在最后一棵树中,标记为s1s2的每个节点的路径标签都是一个公共子字符串。
  6. 此算法在this lecture note中简明扼要地解释。

    对于两个长度为nm的字符串,后缀树构造需要O(max(n,m)),所有匹配的子字符串(在您的情况下,单词或短语)都可以在{ {1}}。

答案 3 :(得分:0)

几年后,但我在下面的“ Counter”中尝试过这种方式:

输入[]:

from collections import Counter

string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
string1 += ' ' + string2
string1 = string1.split()

count = Counter(string1)
tag_count = []
for n, c in count.most_common(10):
    dics = {'tag': n, 'count': c}
    tag_count.append(dics)

输出[]:

[{'tag': 'a', 'count': 4},
 {'tag': 'very', 'count': 3},
 {'tag': 'time', 'count': 2},
 {'tag': 'was', 'count': 2},
 {'tag': 'called', 'count': 2},
 {'tag': 'Jack', 'count': 2},
 {'tag': 'once', 'count': 1},
 {'tag': 'upon', 'count': 1},
 {'tag': 'there', 'count': 1},
 {'tag': 'large', 'count': 1}]

希望这对某人有用:)