正则表达式提取与selenium中的数字的链接

时间:2017-10-09 11:24:31

标签: python regex selenium beautifulsoup

以下是链接列表。

<a class="table-link" href="/tasks/document/new">Should review
</a></td>
<a class="table-link" href="/tasks/document/58324">Should review
</a></td>
<td>
<a class="table-link" href="/tasks/document/58325">AFCO certificate
</a></td>
<td>
<a class="table-link" href="/tasks/document/58325">Document Task
</a></td>
<td>
<a class="table-link" href="/tasks/document/58326">Pending
</a></td>
<td>
<a class="table-link" href="/tasks/document/58327">Cami  ltd
</a></td>
<td>
<a class="table-link" href="/tasks/document/58328">29 Sep 14:57

我想提取所有那些以数字结尾并包含/tasks/document的链接。输出应如下:

 <a class="table-link" href="/tasks/document/58324">
    <a class="table-link" href="/tasks/document/58325">
    <a class="table-link" href="/tasks/document/58326">
    <a class="table-link" href="/tasks/document/58327">
    <a class="table-link" href="/tasks/document/58328">

我使用以下代码driver.find_elements_by_css_selector("a[href*='/tasks/document/']")

如何修改它以仅读取数字?

2 个答案:

答案 0 :(得分:1)

这可以使用BeautifulSoup完成,如下所示:

html = """    
<a class="table-link" href="/tasks/document/new">Should review</a></td>
<a class="table-link" href="/tasks/document/58324">Should review/a></td>
<td>
<a class="table-link" href="/tasks/document/58325">AFCO certificate</a></td>
<td>
<a class="table-link" href="/tasks/document/58325">Document Task</a></td>
<td>
<a class="table-link" href="/tasks/document/58326">Pending</a></td>
<td>
<a class="table-link" href="/tasks/document/58327">Cami  ltd</a></td>
<td>
<a class="table-link" href="/tasks/document/58328">29 Sep 14:57"""

from bs4 import BeautifulSoup        
import re

soup = BeautifulSoup(html, "html.parser")

for a in soup.find_all('a', href=re.compile(r'\/tasks\/document\/\d+')):
    print a

这会显示:

<a class="table-link" href="/tasks/document/58324">Should review</a>
<a class="table-link" href="/tasks/document/58325">AFCO certificate</a>
<a class="table-link" href="/tasks/document/58325">Document Task</a>
<a class="table-link" href="/tasks/document/58326">Pending</a>
<a class="table-link" href="/tasks/document/58327">Cami  ltd</a>
<a class="table-link" href="/tasks/document/58328">29 Sep 14:57</a>

如果您只需要实际href,请使用:

print a['href']

给你:

/tasks/document/58324
/tasks/document/58325
/tasks/document/58325
/tasks/document/58326
/tasks/document/58327
/tasks/document/58328

答案 1 :(得分:0)

硒中没有这样的选择。

如果需要,可以使用selenium获取源代码并将其提供给beautifulsoup解析器。然后你可以使用regexp来找到想要的元素。