我试图抓住这个网站上帖子的所有网址:http://esencjablog.pl/
我是python和web抓取的新手,mt代码可以工作,但它会产生很多重复 - 我做错了什么?
import requests
from bs4 import BeautifulSoup
import csv
startURL = 'http://esencjablog.pl/'
f = csv.writer(open('test.csv', 'a+', newline=''))
f.writerow(['adres'])
def parseLinks(url):
page = requests.get(url).text
soup = BeautifulSoup(page,'lxml')
for a in soup.findAll('a',{'class':'qbutton'}):
href = (a.get('href'))
print('Saved', href)
f.writerow([href])
newlink = soup.find('li', {'class':'next next_last'}).find('a').get('href')
parseLinks(newlink)
parseLinks(startURL)
答案 0 :(得分:3)
尝试以下方法。它不应该再产生重复。事实证明,您的.find_all()
方法也应包含post_more
类名称,以使其按预期方式运行。您可以使用.post_more a.qbutton
来解决此问题:
不推荐:
import requests
from bs4 import BeautifulSoup
startURL = 'http://esencjablog.pl/'
def parseLinks(url):
page = requests.get(url).text
soup = BeautifulSoup(page,'lxml')
links = [a.get('href') for a in soup.select('.post_more a.qbutton')]
for link in links:
print(link)
newlink = soup.select_one('li.next a').get('href')
parseLinks(newlink) ##it will continue on and on and never breaks
if __name__ == '__main__':
parseLinks(startURL)
然而,更好的方法是使用某些东西,以便它可以注意新填充的next_page链接产生新项目或陷入漩涡:
改为:
import requests
from bs4 import BeautifulSoup
page = 58
URL = 'http://esencjablog.pl/page/{}/'
while True:
page+=1
res = requests.get(URL.format(page))
soup = BeautifulSoup(res.text,'lxml')
items = soup.select('.post_more a.qbutton')
if len(items)<=1:break ##when there are no new links it should break
for a in items:
print(a.get("href"))
答案 1 :(得分:2)
您还要定位轮播上的a
元素,它们会固定在您访问的每个页面上。您需要缩小搜索范围。您可以使用类qbutton small
定位元素:
for a in soup.findAll('a', {'class': 'qbutton small'}):
或者您可以像CSS selectors一样使用SIM's answer来指定父元素的类。
答案 2 :(得分:0)
假设要求提取按钮文本&#34; Czytaj dalej&#34;所呈现的所有链接,则以下代码可用。
import requests
from bs4 import BeautifulSoup
import csv
def writerow(row, filename):
with open(filename, 'a', encoding='utf-8', newline='\n') as toWrite:
writer = csv.writer(toWrite)
writer.writerow([row])
def parseLinks(url):
page = requests.get(url)
if page.status_code == 200: # page is fetched
soup = BeautifulSoup(page.text, 'html.parser')
# get total number of pages to scrap
last_page_link = soup.find('li', class_='last').a['href']
number_of_pages = int(last_page_link.split("/")[-2])
# get links from number_of_pages
for pageno in range(0, number_of_pages):
# generate url with page number
# format: http://esencjablog.pl/page/2/
page_url = url+"page/"+str(pageno+1)
# fetch the page, parse links and write to csv
thepage = requests.get(page_url)
if thepage.status_code == 200:
soup = BeautifulSoup(thepage.text, "html.parser")
for a in soup.find_all('a', class_='qbutton small'):
print('Saved {}'.format(a['href']))
writerow(a['href'], 'test.csv')
if __name__ == "__main__":
startURL = 'http://esencjablog.pl/'
parseLinks(startURL)
我认为OP正在重复,因为他也在从顶部滑块抓取链接。
我使用的是html.parser而不是lxml,因为我对它感觉更舒服。