我正在尝试制作一个网络抓取工具,它会获取以下数据: 标题,图像src,描述和位置。除了位于标签内的位置之外的所有上述工作。
此链接显示我正在使用的代码:https://pastebin.com/BFZyyhxB
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('http://www.manchestereveningnews.co.uk/news/greater-manchester-news').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
title = soup.title
image = soup.image
strong = soup.strong
description = soup.description
location = soup.location
title = soup.find('h1', class_='publication-font', )
image = soup.find('img')
strong = soup.find('strong')
location = soup.find('a', 'href', 'em') #This is either done incorrectly or needs more added
description = soup.find('div', class_='description')
print(title.text)
print(image)
print(strong.text)
print(description.string)
print(location)
这显示了我想要抓取的HTML结构。包括em
代码:' https://pastebin.com/zHy7H220'
<div class="teaser"><figure data-mod="image" data-init="true"><div class="spacer" style="padding-top:66.50%;"></div>
<a href="http://www.manchestereveningnews.co.uk/news/greater-manchester-news/mum-who-witnessed-fianc-michael-13374115">
<img srcset="http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s180/Mike-Grimshaw.jpg 180w, http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s390/Mike-Grimshaw.jpg 390w, http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s458/Mike-Grimshaw.jpg 458w" src="http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s615/Mike-Grimshaw.jpg">
</a>
</figure>
<div class="inner">
<em><a href="http://www.manchestereveningnews.co.uk/all-about/sale">Sale</a></em> <------------------ text within the <em> tag is what i am trying to get.
<strong>
<a href="http://www.manchestereveningnews.co.uk/news/greater-manchester-news/mum-who-witnessed-fianc-michael-13374115">Mum who witnessed fiancé Michael Grimshaw being fatally stabbed 'cannot face returning home'</a></strong><div class="description">
<a href="http://www.manchestereveningnews.co.uk/news/greater-manchester-news/mum-who-witnessed-fianc-michael-13374115">A fundraising campaign has been set up to help Mr Grimshaw's family in the wake of his tragic death</a>
</div>
</div>
</div>
你可以看到它什么都不返回,这意味着我的代码不正确。然而,我无法找到如何解决这个问题,无数次尝试寻找教程。
非常感谢任何帮助。
答案 0 :(得分:2)
好的,<em>
标签封装了锚标签。如果您想在该锚点内使用href
链接,我相信您需要:
location = soup.find('em').find('a')['href']
如果是您想要的文字,则用
完成location = soup.find('em').find('a').string # or .text
soup.find
需要一个标记,以及一个指定任何css选择器的可选dict参数。您使用的语法不正确。
答案 1 :(得分:2)
您可以使用css Selector来做到这一点。
soup.select_one("div em > a").get_text(strip=True)