Question

我创建了一个HTML文本清理程序，用于删除标记之间的数据它在一次迭代中工作正常，但不在循环中。

问题是，由于Python的字符串不变性，我无法将newhtml保存为变量所以，我的循环只适用于函数返回的最后一次迭代。

在这种情况下，最佳做法是什么？

def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1: return
        yield start
        start += len(sub) # use start += 1 to find overlapping matches

def replace_string(index1, index2, mainstring):
    replacementstring = ''
    return mainstring.replace(mainstring[index1:index2], replacementstring)

def strip_images(html):
    begin_indexes = list(find_all(html, '<DESCRIPTION>GRAPHIC'))
    end_indexes = list(find_all(html, '</TEXT>'))
        for i in range(len(begin_indexes)):
            if begin_indexes[i] > end_indexes[i]:
                end_indexes.pop(0)
    else:
        if len(begin_indexes) == len(end_indexes):
            break

    for i in range(len(begin_indexes)):
        #code problem is here--
        newhtml = replace_string(begin_indexes[i],end_indexes[i], html)
        if i == len(begin_indexes) - 1:
            return newhtml
            #code only returns one iteration

var = strip_images(html)
print var

Answer 1

您当前的问题是html永远不会在循环内发生变化。因此，无论列表的长度如何，您的输入始终是第一次迭代。

此处的解决方案遵循以下步骤

将字符串分配给循环前的原始值
在循环内编辑，传入当前内容，返回替换后的字符串
循环后从函数返回

newhtml = html 
for begin, end in zip(begin_indexes, end_indexes):
    newhtml = replace_string(begin, end, newhtml)
return newhtml

Answer 2

有了它的工作，这里是代码片段。它并不漂亮，但是它正在删除这两个标签之间的文本：

def find_all(a_str, sub):
   start = 0
   while True:
    start = a_str.find(sub, start)
    if start == -1: return
    yield start
    start += len(sub) # use start += 1 to find overlapping matches

def strip_images(html):
begin_indexes = list(find_all(html, '<DESCRIPTION>GRAPHIC'))
end_indexes = list(find_all(html, '</TEXT>'))
for i in range(len(begin_indexes)):
    if begin_indexes[i] > end_indexes[i]:
        end_indexes.pop(0)
    else:
        if len(begin_indexes) == len(end_indexes):
            break

newhtml = html
begin_indexes2 = begin_indexes[::-1]
end_indexes2 = end_indexes[::-1]
for i in range(len(begin_indexes2)):
#for i, value in enumerate(begin_indexes,0):
    #end_indexes.reset_index(drop=True)
    newhtml = list(newhtml)
    del newhtml[begin_indexes2[i]:end_indexes2[i]]

    if i == len(begin_indexes2) - 1:
        str1 = ''.join(newhtml)
        return str1

在循环中重复替换部分字符串

2 个答案: