我是python的新手,我一直试图从这个HTML代码中获取链接和内部文本:
<div class="someclass">
<ul class="listing">
<li>
<a href="http://link1.com" title="">title1</a>
</li>
<li>
<a href="http://link2.com" title="">title2</a>
</li>
<li>
<a href="http://link3.com" title="">title3</a>
</li>
<li>
<a href="http://link4.com" title="">title4</a>
</li>
</ul>
</div>
我想要来自href http://link.com
的和所有链接以及内部文字title
我试过这段代码
div = soup.find_all('ul',{'class':'listing'})
for li in div:
all_li = li.find_all('li')
for link in all_li.find_all('a'):
print(link.get('href'))
但有人帮助我没有运气
答案 0 :(得分:3)
问题是您使用的是find_all
,它会在您的第二个 forloop 中返回一个列表,您应该使用find()
>>> for ul in soup.find_all('ul', class_='listing'):
... for li in ul.find_all('li'):
... a = li.find('a')
... print(a['href'], a.get_text())
...
http://link1.com title1
http://link2.com title2
http://link3.com title3
http://link4.com title4
您还可以使用CSS selector代替嵌套 forloop
>>> for a in soup.select('.listing li a'):
... print(a['href'], a.get_text(strip=True))
...
http://link1.com title1
http://link2.com title2
http://link3.com title3
http://link4.com title4
答案 1 :(得分:2)
选择ul后获取所有标签,然后从带有标题属性和href的a中提取文本。
from bs4 import BeautifulSoup
html = """<div class="someclass">
<ul class="listing">
<li>
<a href="http://link1.com" title="">title1</a>
</li>
<li>
<a href="http://link2.com" title="">title2</a>
</li>
<li>
<a href="http://link3.com" title="">title3</a>
</li>
<li>
<a href="http://link4.com" title="">title4</a>
</li>
</ul>
</div>"""
soup = BeautifulSoup(html,"lxml")
ul = soup.select("ul.listing")[0]
links = [a["href"] for a in ul.select("a[href]")]
text = [a.text for a in ul.select("a[title]")]
哪个会给你:
['title1', 'title2', 'title3', 'title4']
['http://link1.com', 'http://link2.com', 'http://link3.com', 'http://link4.com']
如果您确实有多个与该类匹配的ul:
uls = soup.select("ul.listing")
links = [a["href"] for ul in uls for a in ul.select("a[href]") ]
text = [a.text for ul in uls for a in ul.select("a[title]")]
print(text)
print(links)
答案 2 :(得分:1)
在您的代码中,<form id="contactForm" method="post" action="scripts/email.php">
<div class="modal-body">
<div class="form-group">
<label for="name">Name</label>
<input type="text" name="name" id="name" class="form-control" placeholder="Please enter your full name here." required />
<label for="name">Email</label>
<input type="text" name="email" id="email" class="form-control" placeholder="Please enter your email address here." required />
<label for="name">Subject</label>
<input type="text" name="subject" id="subject" class="form-control" placeholder="Please enter your subject here." required />
<label for="message">Message</label>
<textarea name="message" class="form-control" placeholder="Please enter your message here." required /></textarea>
</div>
</div>
<div class="modal-footer">
<button type="button" class="btn btn-default" data-dismiss="modal">Close</button>
<button type="submit" class="btn btn-primary">Submit</button>
</div>
</form>
实际上是all_li
元素的列表。在下一行中,您尝试使用它,就好像它是一个单独的元素:
li
相反,您需要遍历all_li.find_all('a')
的元素并在每个上调用all_li
。
这样的事情应该有效:
find_all
这将产生
uls = soup.find_all('ul', {'class': 'listing'})
for ul in uls:
for li in ul.find_all('li'):
for link in li.find_all('a'):
url = link.get('href')
contents = link.text
print (url, contents)