Question

我想从这些HTML代码中使用beautifulsoup获取href，

<a href="first_url" class="class" href="2nd_url" style="15px;">text</a>

从这里，我想得到first_url

但是使用beautifulsoup，

for link in soup.find_all('a',{'class':"class"}): 
            print(link.get('href'))

我得到输出2nd_url

Answer 1

标记具有两个定义的href=属性，该属性无效。但是，如果在其上运行BeautifulSoup的diagnose()函数，它将产生：

data = '''<a href="first_url" class="class" href="2nd_url" style="15px;">text</a>'''

from bs4 import BeautifulSoup
from bs4.diagnose import diagnose

diagnose(data)

打印：

Diagnostic running on Beautiful Soup 4.8.1
Python version 3.6.8 (default, Oct  7 2019, 12:59:55) 
[GCC 8.3.0]
Found lxml version 4.4.1.0
Found html5lib version 1.0.1

Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<a class="class" href="2nd_url" style="15px;">
 text
</a>
--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
 <head>
 </head>
 <body>
  <a class="class" href="first_url" style="15px;">
   text
  </a>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
 <body>
  <a class="class" href="first_url" style="15px;">
   text
  </a>
 </body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<a class="class" href="2nd_url" style="15px;">
 text
</a>
--------------------------------------------------------------------------------

我们看到，如果我们使用lxml或html5lib解析器，则href=将是first_url。 html.parser将给我们2nd_url。

所以：

soup = BeautifulSoup(data, 'lxml')
print(soup.a['href'])

打印：

first_url

如果同一行中有两个href，我如何只能使用beautifulsoup获得一个href？

1 个答案: