段落中最小的子集

时间:2012-06-23 10:57:29

标签: arrays string search

我有一个段落和一些单词列表。 任务是在段落中找到包含所有给定关键字的最小长度子段。匹配应该不区分大小写。

例如,

Text =“你好我是xyz。我是xyz。我是xyz

关键字= {的 “I”, “XYZ”}

输出= “xyz I”(不包括特殊字符,例如 - “。”,“,”)//从索引3开始(从0开始)

我的方法是

1.Give each keyword a ID(integer),starting from 1,2..N(where N is total number of keywords).
2.Create a list(named MATCH),of size equal to the number of words in the text.
3.Scan each word in text.
   if word is a keyword: match[i]=ID of keyword
   else : match[i]=0
4.Now using this MATCH list,find the smallest length sub-list,containing all the numbers between 1,2,..N.
5.The founded sub-list is the answer.

这种方法在某些情况下会产生错误的输出(我不知道)。 如果有人可以建议我在这里做错了什么

这是我的python代码(如果你想看到的话)(缩进有点乱,对不起)

def simplify(s):  //to remove special characters
 z=''
 dels=0

for i in range(0,len(s)):
    if(65<=ord(s[i])<=90 or 97<=ord(s[i])<=122):
        z+=s[i]
    else:
        dels+=1

l=[]
l.append(z)
l.append(dels)

return l

text=raw_input().split(",")
txt=[]
for i in range(0,len(text)):
txt.extend(text[i].split(" "))

matchln=len(txt)
n=int(raw_input())

word=[]
match=[]

for i in range(0,n):
v=simplify(raw_input())
word.append(v[0])

i=0
while(i<matchln):
v=simplify(txt[i])
txt.pop(i)
if(len(v[0])!=0):
    txt.insert(i,v[0])
    flag=0
    for j in range(0,n):
        a=txt[i]
        b=word[j]

        if(a.lower()==b.lower()):
            match.append(j+1)
            flag=1
            break
    if(flag==0):
        match.append(0)
    i+=1


i=0
shortlen=2000001 //infinity
shortstart=0

while(i<matchln):
if(match[i]>0):
    pos=[0]*n
    j=i
    start=i
    sums=0
    while(j<matchln and sums!=n):
        if(match[j]>0):
            if(pos[match[j]-1]==0):
                pos.pop(match[j]-1)
                pos.insert(match[j]-1,1)
                sums+=1
        j+=1

    leng=j-i

    if(j==matchln and sums!=n):
        break

    if(leng==n and sums==n):
        shortlen=leng
        shortstart=start
        break

    if(leng<shortlen):
        shortlen=leng
        shortstart=start

i+=1


if(shortlen==2000001):
print 'NO SUBSEGMENT FOUND'
else:
v=''
i=shortstart
while(i<(shortstart+shortlen-1)):
    v+=(txt[i]+' ')
    i+=1

v+=txt[i]
print v

0 个答案:

没有答案