我想从这些HTML代码中使用beautifulsoup获取href,
<a href="first_url" class="class" href="2nd_url" style="15px;">text</a>
从这里,我想得到first_url
但是使用beautifulsoup,
for link in soup.find_all('a',{'class':"class"}):
print(link.get('href'))
我得到输出2nd_url
答案 0 :(得分:2)
标记具有两个定义的href=
属性,该属性无效。但是,如果在其上运行BeautifulSoup的diagnose()
函数,它将产生:
data = '''<a href="first_url" class="class" href="2nd_url" style="15px;">text</a>'''
from bs4 import BeautifulSoup
from bs4.diagnose import diagnose
diagnose(data)
打印:
Diagnostic running on Beautiful Soup 4.8.1
Python version 3.6.8 (default, Oct 7 2019, 12:59:55)
[GCC 8.3.0]
Found lxml version 4.4.1.0
Found html5lib version 1.0.1
Trying to parse your markup with html.parser
Here's what html.parser did with the markup:
<a class="class" href="2nd_url" style="15px;">
text
</a>
--------------------------------------------------------------------------------
Trying to parse your markup with html5lib
Here's what html5lib did with the markup:
<html>
<head>
</head>
<body>
<a class="class" href="first_url" style="15px;">
text
</a>
</body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml
Here's what lxml did with the markup:
<html>
<body>
<a class="class" href="first_url" style="15px;">
text
</a>
</body>
</html>
--------------------------------------------------------------------------------
Trying to parse your markup with lxml-xml
Here's what lxml-xml did with the markup:
<?xml version="1.0" encoding="utf-8"?>
<a class="class" href="2nd_url" style="15px;">
text
</a>
--------------------------------------------------------------------------------
我们看到,如果我们使用lxml
或html5lib
解析器,则href=
将是first_url
。 html.parser
将给我们2nd_url
。
所以:
soup = BeautifulSoup(data, 'lxml')
print(soup.a['href'])
打印:
first_url