使用beautifulsoup提取网址和标题

时间:2019-10-02 11:26:46

标签: python beautifulsoup

我有以下代码

html_doc = """

<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.283zh5uw21s47nefi4n2" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?283zh5uw21s47nefi4n2" title="Download link1.rar">Link1.rar</a>
</td>
<td class="normal">Size 1.62 MB</td>
</tr>
<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.9hqarjfyw1tpowop9wxc" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?9hqarjfyw1tpowop9wxc" title="Download Link2.rar">Link2.rar</a>
</td>
<td class="normal">Size 297.56 MB</td>
</tr>



"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

all=soup.find_all("td",{"class":"normal alg"})

for item in all:
    a=str(item.find('a').contents[0])
    b=


如何为所有结果提取a和b

a= Link1.rar
b= https://example.com/qr.pl?do=0.283zh5uw21s47nefi4n2

我可以提取网址之间的所有内容,也可以仅提取网址中的所有内容,

谢谢

1 个答案:

答案 0 :(得分:2)

尝试以下代码。选择所有anchor标签,然后获取texthref

html_doc = """

<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.283zh5uw21s47nefi4n2" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?283zh5uw21s47nefi4n2" title="Download link1.rar">Link1.rar</a>
</td>
<td class="normal">Size 1.62 MB</td>
</tr>
<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.9hqarjfyw1tpowop9wxc" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?9hqarjfyw1tpowop9wxc" title="Download Link2.rar">Link2.rar</a>
</td>
<td class="normal">Size 297.56 MB</td>
</tr>

"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

all=soup.select("a[title^='Download']")

for item in all:
        a=item.text
        b=item['href']
        print(a)
        print(b)

或使用此

html_doc = """

<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.283zh5uw21s47nefi4n2" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?283zh5uw21s47nefi4n2" title="Download link1.rar">Link1.rar</a>
</td>
<td class="normal">Size 1.62 MB</td>
</tr>
<tr>
<td class="normal alg" style="padding-left:10px;overflow:hidden;vertical-align:middle">
<img height="57" src="https://example.com/qr.pl?do=0.9hqarjfyw1tpowop9wxc" style="padding:5px;vertical-align:middle" width="57"/>
<a href="https://example.com/?9hqarjfyw1tpowop9wxc" title="Download Link2.rar">Link2.rar</a>
</td>
<td class="normal">Size 297.56 MB</td>
</tr>

"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

all=soup.select("td.normal a[title^='Download']")

for item in all:
    a=item.text
    b=item['href']
    print(a)
    print(b)

输出:

Link1.rar
https://example.com/?283zh5uw21s47nefi4n2
Link2.rar
https://example.com/?9hqarjfyw1tpowop9wxc