好的,所以我使用bs4(BeautifulSoup)来解析一个网站并找到我想要的特定标题。我的代码如下所示:
import requests
from bs4 import BeautifulSoup
url = 'http://www.ewn.co.za/Categories/Local'
r = requests.get(url).text
soup = BeautifulSoup(r)
for i in soup.find_all(class_='article-short'):
if i.a:
print(i.a.text.replace('\n', '').strip())
else:
print(i.contents[0].strip())
此代码有效,但在输出中,在从网站打印请求的标题之前,它首先显示20行空格。我的代码有什么问题,或者我可以做些什么来摆脱空白?
答案 0 :(得分:0)
因为你有这样的元素:
<article class="article-short">
<div class="thumb"><a href="http://ewn.co.za/2016/05/14/Contralesa-against-scrapping-initiation-due-to-cold-weather"><img alt="FILE: Boys who have undergone a circumcision ceremony walk near Qunu in the Eastern Cape in 2013. Picture: AFP." height="147" src="http://ewn.co.za/cdn/-%2fmedia%2f3C37CB28056746CD95FC913757AAD41C.ashx%3fas%3d1%26h%3d147%26w%3d234%26crop%3d1;waeb9b8157b3e310df" width="234"/></a></div>
<h6 class="h6-mega"><a href="http://ewn.co.za/2016/05/14/Contralesa-against-scrapping-initiation-due-to-cold-weather">Contralesa against scrapping initiation due to cold weather</a></h6>
</article>
其中第一个链接包含图像而没有文本。
您应该寻找h6
标签。所以,这样的工作:
import requests
from bs4 import BeautifulSoup
url = 'http://www.ewn.co.za/Categories/Local'
r = requests.get(url).text
soup = BeautifulSoup(r)
for i in soup.find_all(class_='article-short'):
title = (i.h6.text.replace('\n', '') if i.h6 else contents[0]).strip()
if title:
print(title)