[棘手]根据附近搜索多次出现的单词对。蟒蛇

时间:2016-01-08 11:45:06

标签: python regex search

我有一个文本正文,2个关键字说k1,k2。我想找到k1和k2出现在 5个字附近的所有情况。现在我希望存储来自此搜索的2条信息 -

  1. 此类比赛的数量
  2. 最佳匹配的逐字位置。 '最佳'这里指的是k1和k2之间最高接近度的匹配。 这样我以后可以更多地参与这场比赛
  3. 我有一个编写代码,但无法找到匹配,如下所示。此外,它没有给我比赛的数量或单词位置。

    import re
    text = 'the flory of gthys inhibition in this proffession by in aquaporin protein-1  its inhibition by the state of the art in aquaporin 2'
    a = 'aquaporin protein-1'
    b = 'inhibition'
    diff=500
    l = re.split(';|,|-| ', text)
    l1 = re.split(';|,|-| ', a)
    l2 = re.split(';|,|-| ', b)
    counts=[m.start() for m in re.finditer(a, text)]
    counts1=[m.start() for m in re.finditer(b, text)]
    for cc in counts:
        for c1 in counts1:
            if abs(cc-c1) < diff:
                diff = abs(cc-c1)
                values = (cc, c1)
    
    if text.find(a) < text.find(b):
        r= (l.index(l2[0]) - l.index(l1[-1]))
    if text.find(a) > text.find(b):
        r= (l.index(l1[0]) - l.index(l2[-1]))
    if r<5:
        print 'matched'
        print r
    

3 个答案:

答案 0 :(得分:1)

我决定在原始文本中替换您的多字关键字,因为这样可以检测短语,因为它们在用空格分割字符串后不会拆分。

然后是一个带索引和值的简单循环,它使得元组中的计数和跟踪/存储与关键字的位置匹配最小接近度。

text = 'the flory of gthys inhibition in this proffession by in aquaporin protein-1  its inhibition b'
a = 'aquaporin protein-1'
b = 'inhibition'
text = text.replace(a, 'k1')
text = text.replace(b, 'k2')
l = text.split()
#print l
#print 'k1 -> %s' % a
#print 'k2 -> %s' % b

last_a = -1
last_b = -1
counts = 0
max_match_tuple = (6,0)  # Initialize it like this since you want to track proximity less than 5
for k,v in enumerate(l):
        #print str(k) + '--->' + str(v)
        if v == 'k1':
                last_a = k
                if k - last_b < 6 and last_b != -1:
                        counts = counts + 1
                        if k - last_b < max_match_tuple[0] - max_match_tuple[1]:
                             max_match_tuple = (k, last_b)
        if v == 'k2':
                last_b = k
                if k - last_a < 6 and last_a != -1:
                        counts = counts + 1
                        if k - last_a < max_match_tuple[0] - max_match_tuple[1]:
                             max_match_tuple = (k, last_a)  # Careful with the order here since it matters for above substruction 
print counts
print max_match_tuple

关于replace部分的示例的一些解释。您可以在文本中用短语替换您想要检测的内容,以便能够在循环中稍后使用它。因此,如果您想要更改关键字,您只需要更改ab变量定义。

 text = 'the flory of gthys inhibition in this proffession by in aquaporin      protein-1  its inhibition by the state of the art in aquaporin 2'

 a = 'aquaporin protein-1'
 text = text.replace(a, '******')

 print text

 # Output ---> the flory of gthys inhibition in this proffession by in  ******  its inhibition by the state of the art in aquaporin 2

 b = 'in'
 text = text.replace(b, '+++')

 # Output ---> the flory of gthys +++hibition +++ this proffession by +++ ******  its +++hibition by the state of the art +++ aquapor+++ 2

答案 1 :(得分:0)

所以我得到了自己的代码,......

试一试。 好处是它给你一个元组列表(单词之间的距离,关键字1的索引,关键字2的索引):

text = 'the flory of gthys inhibition in this proffession by in aquaporin protein-1 its inhibition b , aquaporin protein-1'
a = 'aquaporin protein-1'
b = 'inhibition'
k1 = "_KEYWORD_1_"
k2 = "_KEYWORD_2_"
text = text.replace(a, k1)
text = text.replace(b, k2)
l = text.split()

d_idx = {k1:[], k2:[]}
for k,v in enumerate(l):
    if v == k1:
        d_idx[k1].append(k)
    elif v == k2:
        d_idx[k2].append(k)

distance = 5
data = []
for idx1 in d_idx[k1]:
    for idx2 in d_idx[k2]:
        d = abs(idx1 - idx2)
        if d<=distance:
            data.append((d,idx1,idx2))

让我们根据关键字的距离对数据进行排序:

data.sort(key=lambda x: x[0])

因此,最近的距离将是数据的第一个元素(虽然可能存在多个具有相同距离的元素):

print "Least distance: ", data[0][0]
print "Index of kw1 and kw2: ", data[0][1:]
print "Number of occurences: ", len(data)

--------------编辑-----------
因此,如果您想将一些多字词视为一个单词(为了考虑距离),您必须先替换它们,这个(未经测试的)代码可能会起作用。

input = 'the flory of gthys inhibition in this proffession by in aquaporin protein-1 its inhibition b , aquaporin protein-1'

a = 'aquaporin protein-1'
b = 'inhibition'

multiwords = ['aquaporin protein-1']
for mw in multiwords:
    mw_no_space = mw.replace(' ', '__')
    text = input.replace(mw, mw_no_space)
k1 = a.replace(' ', '__')
k2 = b.replace(' ', '__')

l = text.split()

d_idx = {k1:[], k2:[]}
for k,v in enumerate(l):
    if v == k1:
        d_idx[k1].append(k)
    elif v == k2:
        d_idx[k2].append(k)

distance = 10
data = []
for idx1 in d_idx[k1]:
    for idx2 in d_idx[k2]:
        d = abs(idx1 - idx2)
        if d<=distance:
            data.append((d,idx1,idx2))

data.sort(key=lambda x: x[0])
print data

print "Least distance: ", data[0][0]
print "Index of kw1 and kw2: ", data[0][1:]
print "Number of occurences: ", len(data)

答案 2 :(得分:0)

从理论上讲,你可以使用正则表达式来完成它,但是支持所有边缘情况会非常麻烦。

简单表格是:https://regex101.com/r/zW1dD3/2

(?P<K1>key1)\s+(?P<BETWEEN>(\w+\s+(?!key2)){0,4}\w+\s+)?(?P<K2>key2)

示例数据:

word0 key1 key2 word1 word0 key1 word1 word2 key2 word3 word0 key1 word1 word2 word3 key2 word4 word0 key1 word1 word2 word3 word4 key2 word5 word0 key1 word1 word2 word3 word4 word5 key2 word6 word0 key1 word1 word2 word3 word4 word5 word6 key2 word7