Question

我试图抓住这个网站上帖子的所有网址：http://esencjablog.pl/

我是python和web抓取的新手，mt代码可以工作，但它会产生很多重复 - 我做错了什么？

import requests
from bs4 import BeautifulSoup
import csv

startURL = 'http://esencjablog.pl/'
f = csv.writer(open('test.csv', 'a+', newline=''))
f.writerow(['adres'])
def parseLinks(url):
    page = requests.get(url).text
    soup = BeautifulSoup(page,'lxml')
    for a in soup.findAll('a',{'class':'qbutton'}):
        href = (a.get('href'))
        print('Saved', href)
        f.writerow([href])

    newlink = soup.find('li', {'class':'next next_last'}).find('a').get('href')
    parseLinks(newlink)
parseLinks(startURL)

Answer 1

尝试以下方法。它不应该再产生重复。事实证明，您的.find_all()方法也应包含post_more类名称，以使其按预期方式运行。您可以使用.post_more a.qbutton来解决此问题：

不推荐：

import requests
from bs4 import BeautifulSoup

startURL = 'http://esencjablog.pl/'

def parseLinks(url):
    page = requests.get(url).text
    soup = BeautifulSoup(page,'lxml')
    links = [a.get('href') for a in soup.select('.post_more a.qbutton')]
    for link in links:
        print(link)

    newlink = soup.select_one('li.next a').get('href')
    parseLinks(newlink)  ##it will continue on and on and never breaks

if __name__ == '__main__':
    parseLinks(startURL)

然而，更好的方法是使用某些东西，以便它可以注意新填充的next_page链接产生新项目或陷入漩涡：

改为：

import requests
from bs4 import BeautifulSoup

page = 58
URL = 'http://esencjablog.pl/page/{}/'

while True:
    page+=1
    res = requests.get(URL.format(page))
    soup = BeautifulSoup(res.text,'lxml')
    items = soup.select('.post_more a.qbutton')
    if len(items)<=1:break  ##when there are no new links it should break

    for a in items:
        print(a.get("href"))

Answer 2

您还要定位轮播上的a元素，它们会固定在您访问的每个页面上。您需要缩小搜索范围。您可以使用类qbutton small定位元素：

for a in soup.findAll('a', {'class': 'qbutton small'}):

或者您可以像CSS selectors一样使用SIM's answer来指定父元素的类。

Answer 3

假设要求提取按钮文本＆＃34; Czytaj dalej＆＃34;所呈现的所有链接，则以下代码可用。

import requests
from bs4 import BeautifulSoup
import csv


def writerow(row, filename):
    with open(filename, 'a', encoding='utf-8', newline='\n') as toWrite:
        writer = csv.writer(toWrite)
        writer.writerow([row])


def parseLinks(url):
    page = requests.get(url)
    if page.status_code == 200:     # page is fetched
        soup = BeautifulSoup(page.text, 'html.parser')

        # get total number of pages to scrap
        last_page_link = soup.find('li', class_='last').a['href']
        number_of_pages = int(last_page_link.split("/")[-2])

        # get links from number_of_pages 
        for pageno in range(0, number_of_pages):
            # generate url with page number
            # format: http://esencjablog.pl/page/2/
            page_url = url+"page/"+str(pageno+1)

            # fetch the page, parse links and write to csv
            thepage = requests.get(page_url)
            if thepage.status_code == 200:
                soup = BeautifulSoup(thepage.text, "html.parser")
                for a in soup.find_all('a', class_='qbutton small'):
                    print('Saved {}'.format(a['href']))
                    writerow(a['href'], 'test.csv')


if __name__ == "__main__":    
    startURL = 'http://esencjablog.pl/'
    parseLinks(startURL)

我认为OP正在重复，因为他也在从顶部滑块抓取链接。

我使用的是html.parser而不是lxml，因为我对它感觉更舒服。

Webscraping - 写入CSV时重复

3 个答案: