Question

我目前正在汇集此功能以检查多个网址。它将一个html页面读入一个字符串，并匹配文件传输的进度百分比，如下所示：

def check(server):
    logging.info('Fetching {0}.'.format(server))
    # Open page
    response = urllib2.urlopen("http://"+server+"/avicapture.html")
    tall = response.read() # puts the data into a string
    html = tall.rstrip()
    # Grab progress percentage.
    match = re.search('.*In Progress \((.*)%\).*', html)

然后在此匹配项上，将字符串中的百分比数字返回给父进程。

    if match:
        global temp
        global results
        temp = match.group(1)
        results = temp
        servers[server] = temp
        if int(temp) >= 98 and int(temp) <= 99:
            abort(server)
            alertmail(temp, server)
            rem = str(server)
            complete(rem)
            logging.info('{0} completed.'.format(server))
        return str(temp)

然而，有时它不会说“进行中”并且有百分比。它会说“转移中止”或“准备就绪”。我如何构建它以便它返回它找到的任何一个，进行中（百分比），转移中止或准备好？

编辑：我忘了提到我需要它来匹配最近的文件传输，基于结束时间。（见：http://www.whatdoiknow.net/dump/avicapture_full.html#status）

部分解决方案：

    match = re.search('.*In Progress \((.*)%\).*', html)
    match2 = re.search('.*Ready.*', html)
    match3 = re.search('.*Transfer Aborted.*', html)
    if match:
        global temp
        temp = match.group(1)
        if int(temp) >= 98 and int(temp) <= 99:
            logging.info('{0} completed.'.format(server))
        return str(temp)
    elif match2:
        temp = "Ready"
        logging.info('{0} is ready.'.format(server))
        return str(temp)
    elif match3:
        temp = "Transfer Aborted"
        logging.info('{0} was Aborted.'.format(server))
        return str(temp)

但这并不能解决我对识别最近转移的需要。

Answer 1

您只需在正则表达式中使用|：

match = re.search(r"(In Progress \((.*)%\)|Transfer Aborted|Ready)", html)

这个match.group(1)将包含所有匹配项（In Progress (00%)，Transfer Aborted或Ready，而match.group(2)将包含数字00（00是第二个和第三个案例中的无占位符。

更新1 ：关于需要获取最新行。这个http://www.whatdoiknow.net/dump/avicapture.html页面是相当简单的html，所以我的建议是使用一些html解析工具（我推荐beautifulsoup4，docs：http://www.crummy.com/software/BeautifulSoup/bs4/doc/）将它解析为树，然后用N /表找到表中的第一行A，之前获取行并将re应用于其最后一列。

更新2 ：现在我考虑一下，可能没有必要解析HTML。您可以使用re.findall（或re.finditer）获取匹配的字符串元组列表（匹配对象），并从中获取最后一项。

更新3 ：假设更新1和更新2，该表按日期排序。如果没有，那么您需要在正则表达式中包含日期模式，并从匹配项中获取最大日期。

Python Regex：相同字符串中的多个条件匹配

1 个答案: