我有以下代码
html_doc = """
<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.283zh5uw21s47nefi4n2" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?283zh5uw21s47nefi4n2" title="Download link1.rar">Link1.rar</a>
</td>
<td class="normal">Size 1.62 MB</td>
</tr>
<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.9hqarjfyw1tpowop9wxc" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?9hqarjfyw1tpowop9wxc" title="Download Link2.rar">Link2.rar</a>
</td>
<td class="normal">Size 297.56 MB</td>
</tr>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
all=soup.find_all("td",{"class":"normal alg"})
for item in all:
a=str(item.find('a').contents[0])
b=
如何为所有结果提取a和b
a= Link1.rar
b= https://example.com/qr.pl?do=0.283zh5uw21s47nefi4n2
我可以提取网址之间的所有内容,也可以仅提取网址中的所有内容,
谢谢
答案 0 :(得分:2)
尝试以下代码。选择所有anchor
标签,然后获取text
和href
值
html_doc = """
<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.283zh5uw21s47nefi4n2" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?283zh5uw21s47nefi4n2" title="Download link1.rar">Link1.rar</a>
</td>
<td class="normal">Size 1.62 MB</td>
</tr>
<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.9hqarjfyw1tpowop9wxc" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?9hqarjfyw1tpowop9wxc" title="Download Link2.rar">Link2.rar</a>
</td>
<td class="normal">Size 297.56 MB</td>
</tr>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
all=soup.select("a[title^='Download']")
for item in all:
a=item.text
b=item['href']
print(a)
print(b)
或使用此
html_doc = """
<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.283zh5uw21s47nefi4n2" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?283zh5uw21s47nefi4n2" title="Download link1.rar">Link1.rar</a>
</td>
<td class="normal">Size 1.62 MB</td>
</tr>
<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.9hqarjfyw1tpowop9wxc" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?9hqarjfyw1tpowop9wxc" title="Download Link2.rar">Link2.rar</a>
</td>
<td class="normal">Size 297.56 MB</td>
</tr>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
all=soup.select("td.normal a[title^='Download']")
for item in all:
a=item.text
b=item['href']
print(a)
print(b)
输出:
Link1.rar
https://example.com/?283zh5uw21s47nefi4n2
Link2.rar
https://example.com/?9hqarjfyw1tpowop9wxc