如果同一行中有两个href,我如何只能使用beautifulsoup获得一个href?

时间:2019-12-07 14:11:34

标签: python beautifulsoup

我想从这些HTML代码中使用beautifulsoup获取href,

<a href="first_url" class="class" href="2nd_url" style="15px;">text</a>

从这里,我想得到first_url

但是使用beautifulsoup,

for link in soup.find_all('a',{'class':"class"}): 
            print(link.get('href'))

我得到输出2nd_url

1 个答案:

答案 0 :(得分:2)

标记具有两个定义的href=属性,该属性无效。但是,如果在其上运行BeautifulSoup的diagnose()函数,它将产生:

data = '''<a href="first_url" class="class" href="2nd_url" style="15px;">text</a>'''

from bs4 import BeautifulSoup
from bs4.diagnose import diagnose

diagnose(data)

打印:

Diagnostic running on Beautiful Soup 4.8.1
Python version 3.6.8 (default, Oct  7 2019, 12:59:55) 
[GCC 8.3.0]
Found lxml version 4.4.1.0
Found html5lib version 1.0.1

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<a class="class" href="2nd_url" style="15px;">
 text
</a>
--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
 <head>
 </head>
 <body>
  <a class="class" href="first_url" style="15px;">
   text
  </a>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
 <body>
  <a class="class" href="first_url" style="15px;">
   text
  </a>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<a class="class" href="2nd_url" style="15px;">
 text
</a>
--------------------------------------------------------------------------------

我们看到,如果我们使用lxmlhtml5lib解析器,则href=将是first_urlhtml.parser将给我们2nd_url

所以:

soup = BeautifulSoup(data, 'lxml')
print(soup.a['href'])

打印:

first_url