以下是链接列表。
<a class="table-link" href="/tasks/document/new">Should review
</a></td>
<a class="table-link" href="/tasks/document/58324">Should review
</a></td>
<td>
<a class="table-link" href="/tasks/document/58325">AFCO certificate
</a></td>
<td>
<a class="table-link" href="/tasks/document/58325">Document Task
</a></td>
<td>
<a class="table-link" href="/tasks/document/58326">Pending
</a></td>
<td>
<a class="table-link" href="/tasks/document/58327">Cami ltd
</a></td>
<td>
<a class="table-link" href="/tasks/document/58328">29 Sep 14:57
我想提取所有那些以数字结尾并包含/tasks/document
的链接。输出应如下:
<a class="table-link" href="/tasks/document/58324">
<a class="table-link" href="/tasks/document/58325">
<a class="table-link" href="/tasks/document/58326">
<a class="table-link" href="/tasks/document/58327">
<a class="table-link" href="/tasks/document/58328">
我使用以下代码driver.find_elements_by_css_selector("a[href*='/tasks/document/']")
如何修改它以仅读取数字?
答案 0 :(得分:1)
这可以使用BeautifulSoup完成,如下所示:
html = """
<a class="table-link" href="/tasks/document/new">Should review</a></td>
<a class="table-link" href="/tasks/document/58324">Should review/a></td>
<td>
<a class="table-link" href="/tasks/document/58325">AFCO certificate</a></td>
<td>
<a class="table-link" href="/tasks/document/58325">Document Task</a></td>
<td>
<a class="table-link" href="/tasks/document/58326">Pending</a></td>
<td>
<a class="table-link" href="/tasks/document/58327">Cami ltd</a></td>
<td>
<a class="table-link" href="/tasks/document/58328">29 Sep 14:57"""
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, "html.parser")
for a in soup.find_all('a', href=re.compile(r'\/tasks\/document\/\d+')):
print a
这会显示:
<a class="table-link" href="/tasks/document/58324">Should review</a>
<a class="table-link" href="/tasks/document/58325">AFCO certificate</a>
<a class="table-link" href="/tasks/document/58325">Document Task</a>
<a class="table-link" href="/tasks/document/58326">Pending</a>
<a class="table-link" href="/tasks/document/58327">Cami ltd</a>
<a class="table-link" href="/tasks/document/58328">29 Sep 14:57</a>
如果您只需要实际href
,请使用:
print a['href']
给你:
/tasks/document/58324
/tasks/document/58325
/tasks/document/58325
/tasks/document/58326
/tasks/document/58327
/tasks/document/58328
答案 1 :(得分:0)
硒中没有这样的选择。
如果需要,可以使用selenium获取源代码并将其提供给beautifulsoup解析器。然后你可以使用regexp来找到想要的元素。