我创建了一个HTML文本清理程序,用于删除标记之间的数据 它在一次迭代中工作正常,但不在循环中。
问题是,由于Python的字符串不变性,我无法将newhtml保存为变量 所以,我的循环只适用于函数返回的最后一次迭代。
在这种情况下,最佳做法是什么?
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1: return
yield start
start += len(sub) # use start += 1 to find overlapping matches
def replace_string(index1, index2, mainstring):
replacementstring = ''
return mainstring.replace(mainstring[index1:index2], replacementstring)
def strip_images(html):
begin_indexes = list(find_all(html, '<DESCRIPTION>GRAPHIC'))
end_indexes = list(find_all(html, '</TEXT>'))
for i in range(len(begin_indexes)):
if begin_indexes[i] > end_indexes[i]:
end_indexes.pop(0)
else:
if len(begin_indexes) == len(end_indexes):
break
for i in range(len(begin_indexes)):
#code problem is here--
newhtml = replace_string(begin_indexes[i],end_indexes[i], html)
if i == len(begin_indexes) - 1:
return newhtml
#code only returns one iteration
var = strip_images(html)
print var
答案 0 :(得分:0)
您当前的问题是html
永远不会在循环内发生变化。因此,无论列表的长度如何,您的输入始终是第一次迭代。
此处的解决方案遵循以下步骤
newhtml = html
for begin, end in zip(begin_indexes, end_indexes):
newhtml = replace_string(begin, end, newhtml)
return newhtml
答案 1 :(得分:0)
有了它的工作,这里是代码片段。它并不漂亮,但是它正在删除这两个标签之间的文本:
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1: return
yield start
start += len(sub) # use start += 1 to find overlapping matches
def strip_images(html):
begin_indexes = list(find_all(html, '<DESCRIPTION>GRAPHIC'))
end_indexes = list(find_all(html, '</TEXT>'))
for i in range(len(begin_indexes)):
if begin_indexes[i] > end_indexes[i]:
end_indexes.pop(0)
else:
if len(begin_indexes) == len(end_indexes):
break
newhtml = html
begin_indexes2 = begin_indexes[::-1]
end_indexes2 = end_indexes[::-1]
for i in range(len(begin_indexes2)):
#for i, value in enumerate(begin_indexes,0):
#end_indexes.reset_index(drop=True)
newhtml = list(newhtml)
del newhtml[begin_indexes2[i]:end_indexes2[i]]
if i == len(begin_indexes2) - 1:
str1 = ''.join(newhtml)
return str1