我有一个段落和一些单词列表。 任务是在段落中找到包含所有给定关键字的最小长度子段。匹配应该不区分大小写。
例如,
Text =“你好我是xyz。我是xyz。我是xyz ”
关键字= {的 “I”, “XYZ”}
输出= “xyz I”(不包括特殊字符,例如 - “。”,“,”)//从索引3开始(从0开始)
我的方法是
1.Give each keyword a ID(integer),starting from 1,2..N(where N is total number of keywords).
2.Create a list(named MATCH),of size equal to the number of words in the text.
3.Scan each word in text.
if word is a keyword: match[i]=ID of keyword
else : match[i]=0
4.Now using this MATCH list,find the smallest length sub-list,containing all the numbers between 1,2,..N.
5.The founded sub-list is the answer.
这种方法在某些情况下会产生错误的输出(我不知道)。 如果有人可以建议我在这里做错了什么
这是我的python代码(如果你想看到的话)(缩进有点乱,对不起)
def simplify(s): //to remove special characters
z=''
dels=0
for i in range(0,len(s)):
if(65<=ord(s[i])<=90 or 97<=ord(s[i])<=122):
z+=s[i]
else:
dels+=1
l=[]
l.append(z)
l.append(dels)
return l
text=raw_input().split(",")
txt=[]
for i in range(0,len(text)):
txt.extend(text[i].split(" "))
matchln=len(txt)
n=int(raw_input())
word=[]
match=[]
for i in range(0,n):
v=simplify(raw_input())
word.append(v[0])
i=0
while(i<matchln):
v=simplify(txt[i])
txt.pop(i)
if(len(v[0])!=0):
txt.insert(i,v[0])
flag=0
for j in range(0,n):
a=txt[i]
b=word[j]
if(a.lower()==b.lower()):
match.append(j+1)
flag=1
break
if(flag==0):
match.append(0)
i+=1
i=0
shortlen=2000001 //infinity
shortstart=0
while(i<matchln):
if(match[i]>0):
pos=[0]*n
j=i
start=i
sums=0
while(j<matchln and sums!=n):
if(match[j]>0):
if(pos[match[j]-1]==0):
pos.pop(match[j]-1)
pos.insert(match[j]-1,1)
sums+=1
j+=1
leng=j-i
if(j==matchln and sums!=n):
break
if(leng==n and sums==n):
shortlen=leng
shortstart=start
break
if(leng<shortlen):
shortlen=leng
shortstart=start
i+=1
if(shortlen==2000001):
print 'NO SUBSEGMENT FOUND'
else:
v=''
i=shortstart
while(i<(shortstart+shortlen-1)):
v+=(txt[i]+' ')
i+=1
v+=txt[i]
print v