如何用python BeautifulSoup分页抓页

时间:2017-06-02 13:13:21

标签: python beautifulsoup

我是编程新手,我在使用python BeautifulSoup抓取所有页面时遇到问题。我想出了如何刮第一页,但我迷失了如何做所有页面。

Here is the code:
#!/usr/bin/python
# -*- encoding: utf-8 -*-
from urllib2 import urlopen
import json
from BeautifulSoup import BeautifulSoup

defaultPage = 1
items = []
url = "https://www.nepremicnine.net/oglasi-prodaja/ljubljana-mesto/stanovanje/%d/"

def getWebsiteContent(page=defaultPage):
    return urlopen(url % (page)).read()

def writeToFile(content):
    file = open("nepremicnine1.json", "w+")
    json.dump(content, file)
    # file.write(content)
    file.close()

def main():

    content = getWebsiteContent(page=defaultPage)
    soup = BeautifulSoup(content)
    posesti = soup.findAll("div", {"itemprop": "itemListElement"})

    for stanovanja in posesti:
        item = {}
        item["Naslov"] = stanovanja.find("span", attrs={"class": "title"}).string
        item["Velikost"] = stanovanja.find("span", attrs={"class": "velikost"}).string
        item["Cena"] = stanovanja.find("span", attrs={"class": "cena"}).string
        item["Slika"] = stanovanja.find("img", src = True)["src"]

        items.append(item)

        writeToFile(items)

main()

所以我想循环,所以url%d每次都会增加1,因为页面编号为2,页面编号为3等。

非常感谢所有帮助。

1 个答案:

答案 0 :(得分:1)

您没有递增defaultPage变量。

您尝试这样做的方式是正确的。每次完成抓页时,您只需增加def main(): while (defaultPage <= numPages) # Loop through all pages. You also need to define the value of numPages. content = getWebsiteContent(page=defaultPage) soup = BeautifulSoup(content) posesti = soup.findAll("div", {"itemprop": "itemListElement"}) for stanovanja in posesti: item = {} item["Naslov"] = stanovanja.find("span", attrs={"class": "title"}).string item["Velikost"] = stanovanja.find("span", attrs={"class": "velikost"}).string item["Cena"] = stanovanja.find("span", attrs={"class": "cena"}).string item["Slika"] = stanovanja.find("img", src = True)["src"] items.append(item) writeToFile(items) defaultPage += 1 变量

        Socket ReceiveSocket = new Socket(AddressFamily.InterNetwork, SocketType.Raw, ProtocolType.IP);
        EndPoint DefaultIPEndpoint = new IPEndPoint(IPAddress.Parse("10.0.2.0"), 0);

        ReceiveSocket.ReceiveTimeout = 5000;
        ReceiveSocket.Bind(DefaultIPEndpoint);
        ReceiveSocket.IOControl(IOControlCode.ReceiveAll, new byte[4] { 1, 0, 0, 0 }, null);

        while (true)
        {
            byte[] ReceiveBuffer = new byte[512];
            int ByteCount = 0;

            ByteCount = ReceiveSocket.ReceiveFrom(ReceiveBuffer, ref DefaultIPEndpoint);
            // Handle packets ...
        }

我认为这应该有用