Question

（这是前一个post的后续问题，用户https://stackoverflow.com/users/771848/alecxe帮助了我。更有意义的是将这个后续发布作为一个独立的问题，所以它更多可供其他人搜索。）

我有一个使用Beautiful Soup的python脚本，可以在托管服务上找到一些网络报告。

现在脚本非常严格。我想让它更灵活一些。我觉得reg-ex是我需要的东西，但也许一些嵌套搜索也会起作用。我愿意接受建议。

我当前的代码就像：

def search_table_for_report(table, report_name, report_type):
    #search rows of table to find given report name, then grab the download URL for the given type
    for row in table.findAll('tr')[1:]:
        #the [1:]: modifier instructs the loop to skip the first item, aka the headers.
        col = row.findAll('td')

        if report_name in col[0].string:
            print "----- parse out file type request url"
            report_type = report_type.upper()
            #this works, using exact match
            label = row.find("input", {"aria-label": "Select " + report_name + " I format " + report_type})
            #this doesn't work, using reg-ex
            #label = row.find("input", {"aria-label": re.compile("\b" + report_name + ".*\b" + report_type + ".*")})

            print "----- okay found the right checkbox, now grab the href link ----"
            link_url = label.find_next_sibling("a", href=True)["href"]
            return link_url

哪个会搜索这样的表：

<tr class="odd">
 <td header="c1">
  Report Download
 </td>
 <td header="c2">
  <input aria-label="Select Report I format PDF" id="documentChkBx0" name="documentChkBx" type="checkbox" value="5446"/>
  <a href="/a/document.html?key=5446">
   <img alt="Portable Document Format" src="/img/icons/icon_PDF.gif">
   </img>
  </a>
  <input aria-label="Select Report I format XLS" id="documentChkBx1" name="documentChkBx" type="checkbox" value="5447"/>
  <a href="/a/document.html?key=5447">
   <img alt="Excel Spreadsheet Format" src="/img/icons/icon_XLS.gif">
   </img>
  </a>
 </td>
 <td header="c4">
  04/27/2015
 </td>
 <td header="c5">
  05/26/2015
 </td>
 <td header="c6">
  05/26/2015 10:00AM EDT
 </td>
</tr>

我想搜索＆＃34; aria-label＆＃34;两个值的值，或者其中的两个部分匹配。基本上，有时候我可能需要找到＆＃34;选择矩阵格式PDF＆＃34; ，而不是找到＆＃34;选择报告格式XLS＆＃34; 。很确定＆＃34;选择＆＃34;和＆＃34;格式＆＃34; bit总是在那里，但不能确定，所以只需要将第二个单词和最终扩展类型设为部分匹配搜索。部分位（而不是精确位置）很重要，因为有时候＆＃34;报告＆＃34;单词可能包含我不希望的尾随单词，例如＆＃34;选择报告II格式XLS＆＃34; 等，如果它是＆的精确搜索，则会失败＃34;选择报告格式XLS＆＃34; 。

所以我需要代码（regex presuambly）来搜索给定的名称（代替Report）和给定的类型（代替XLS）这是我尝试过的，但它不是工作。我认为reg-ex语法很好，但我认为我将re.compile堵塞在错误的位置，以Beautiful Soup不期望的方式使用它。

label = row.find("input", {"aria-label": re.compile("\b" + report_name + ".*\b" + report_type + ".*")})

希望我解释得那么好。很高兴澄清任何困惑。

Answer 1

我想出了这个问题。我的BS4搜索技术很好，只是需要更聪明的正则表达式模式。使用以下功能很棒！我不确定如何使这个搜索案例不敏感，但现在没关系。

#build the pattern to search on 
#where report_name and report_type are strings passed into the function
regex_criteria = r'.*' + report_name + r'.*' + report_type

#search the value of the "aria-label" attribute 
#across all the inputs on the page
target_input = row.find("input", {"aria-label": re.compile(regex_criteria)})

尝试使用Beautiful Soup（Python）在属性的值中查找2个部分匹配

1 个答案: