我是编程新手,我在使用python BeautifulSoup抓取所有页面时遇到问题。我想出了如何刮第一页,但我迷失了如何做所有页面。
Here is the code:
#!/usr/bin/python
# -*- encoding: utf-8 -*-
from urllib2 import urlopen
import json
from BeautifulSoup import BeautifulSoup
defaultPage = 1
items = []
url = "https://www.nepremicnine.net/oglasi-prodaja/ljubljana-mesto/stanovanje/%d/"
def getWebsiteContent(page=defaultPage):
return urlopen(url % (page)).read()
def writeToFile(content):
file = open("nepremicnine1.json", "w+")
json.dump(content, file)
# file.write(content)
file.close()
def main():
content = getWebsiteContent(page=defaultPage)
soup = BeautifulSoup(content)
posesti = soup.findAll("div", {"itemprop": "itemListElement"})
for stanovanja in posesti:
item = {}
item["Naslov"] = stanovanja.find("span", attrs={"class": "title"}).string
item["Velikost"] = stanovanja.find("span", attrs={"class": "velikost"}).string
item["Cena"] = stanovanja.find("span", attrs={"class": "cena"}).string
item["Slika"] = stanovanja.find("img", src = True)["src"]
items.append(item)
writeToFile(items)
main()
所以我想循环,所以url%d每次都会增加1,因为页面编号为2,页面编号为3等。
非常感谢所有帮助。
答案 0 :(得分:1)
您没有递增defaultPage
变量。
您尝试这样做的方式是正确的。每次完成抓页时,您只需增加def main():
while (defaultPage <= numPages) # Loop through all pages. You also need to define the value of numPages.
content = getWebsiteContent(page=defaultPage)
soup = BeautifulSoup(content)
posesti = soup.findAll("div", {"itemprop": "itemListElement"})
for stanovanja in posesti:
item = {}
item["Naslov"] = stanovanja.find("span", attrs={"class": "title"}).string
item["Velikost"] = stanovanja.find("span", attrs={"class": "velikost"}).string
item["Cena"] = stanovanja.find("span", attrs={"class": "cena"}).string
item["Slika"] = stanovanja.find("img", src = True)["src"]
items.append(item)
writeToFile(items)
defaultPage += 1
变量
Socket ReceiveSocket = new Socket(AddressFamily.InterNetwork, SocketType.Raw, ProtocolType.IP);
EndPoint DefaultIPEndpoint = new IPEndPoint(IPAddress.Parse("10.0.2.0"), 0);
ReceiveSocket.ReceiveTimeout = 5000;
ReceiveSocket.Bind(DefaultIPEndpoint);
ReceiveSocket.IOControl(IOControlCode.ReceiveAll, new byte[4] { 1, 0, 0, 0 }, null);
while (true)
{
byte[] ReceiveBuffer = new byte[512];
int ByteCount = 0;
ByteCount = ReceiveSocket.ReceiveFrom(ReceiveBuffer, ref DefaultIPEndpoint);
// Handle packets ...
}
我认为这应该有用