如何使用BeautifulSoup停止两次打印文章

时间:2019-05-11 06:48:11

标签: python web-scraping beautifulsoup python-3.7

我正在尝试打印此网站上的每个文章链接,并且文章链接打印了两次,并且只打印了其中的5个。

我尝试将范围扩大到(1,20),可以打印所有十个文章链接,但每个链接两次。

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = urlopen("https://www.politico.com/newsletters/playbook/archive")
target = 'C:/Users/k/Politico/pol.csv'

content = url.read()

soup = BeautifulSoup(content,"lxml")

for article in range (1,10):
    #Prints each article's link and saves to csv file
    print(soup('article')[article]('a',{'target':'_top'}))

我希望输出的结果是10条文章链接,并且没有重复。

3 个答案:

答案 0 :(得分:1)

您可以使用css选择器.front-list h3> a

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.politico.com/newsletters/playbook/archive#')
soup = bs(r.content, 'lxml')
links = [link['href'] for link in soup.select('.front-list h3 > a')]
print(links)

答案 1 :(得分:0)

尝试打印您的汤,看看在每次迭代中如何,有2个链接并且相同。 因此,它打印两次。

拿一套放所有str(data)

a = set()
for article in range (1,20):
    a.add((str(soup('article')[article]('a',{'target':'_top'}))))

print(a) 

答案 2 :(得分:0)

您可以使用下面的方法,就像一个超级按钮。

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = urlopen("https://www.politico.com/newsletters/playbook/archive")
target = 'C:/Users/k/Politico/pol.csv'
content = url.read()
soup = BeautifulSoup(content,"lxml")

articles = soup.findAll('article', attrs={'class':'story-frag format-l'})

for article in articles:
    link = article.find('a', attrs={'target':'_top'}).get('href')
    print(link)

enter image description here预期输出如上