从解析的HTML中删除转义序列

时间:2013-09-17 20:16:19

标签: python html escaping mechanize

我正在使用Python mechanize模块向网站提交一个简单的查询,然后分解返回的元素以获取我需要的数据。但我似乎无法正确处理传回的转义序列。这是我的代码:

def stripEscape(string):    #credit goes to sarnold
    delete = ""
    i=1
    while (i<0x20):
        delete += chr(i)
        i += 1
    t = string.translate(None, delete)
    return t

def getHTML(metID):
    br = mechanize.Browser()
    response = br.open("http://urlgoeshere.com")

    br.form = list(br.forms())[0]
    br["PROMPT12"] = metID

    response = br.submit()
    htmlText = response.read()
    parseHTML(htmlText)

def parseHTML(htmlText):
    htmlText.index('table')
    arr = re.split(r'(</?\w{2}>)',htmlText)   # everything after background tag 
    logFile = open('Log.txt','wb')

    for ele in arr:
        ele = stripEscape(ele)
        if ele == '':
            arr.remove(ele)

    for ele in arr:
        logFile.write("ele: "+ele+'\n') 
        if re.match('/table', ele):
            logFile.write("END OF TABLE FOUND")
            logFile.write("\nele: "+ele+'\n')
            break
        # other element filters

当我通过交互式shell传递参数时,stripEscape函数工作正常,但网站中的一个数组元素是\r\n</table>\r\n,这会“逃脱”我的过滤器。它会像我这样写入我的日志文件:

ele: normal
ele: stuff
ele: 
</table>

ele: more
ele: normal

绕过过滤器的结束表标签会导致我的所有其他过滤器变得混乱。有没有更好的方法来处理转义序列?

1 个答案:

答案 0 :(得分:1)

第一个for循环中的ele元素未保存到数组中。

for ele in arr:
    ele = stripEscape(ele)
    if ele == '':
        arr.remove(ele)

此部分代码只会更改ele元素 NOT arrarr将保持不变。因此,所有转义序列都将被 NOT 删除。您可以在该循环后打印arr来测试它。

所以你需要做的是将它保存为一个新的数组,然后可以被下一个循环使用。它可以是这样的:

for ele in arr:
    if ele != "":
        newArray.append(stripEscape(ele))


for ele in newArray:
    logFile.write("ele: "+ele+'\n') 
    if re.match('/table', ele):
        logFile.write("END OF TABLE FOUND")
        logFile.write("\nele: "+ele+'\n')
        break