Python Web Scrape未知页数

时间:2017-12-07 03:12:24

标签: python-3.x web-scraping beautifulsoup

我有工作代码可以抓取单个Craigslist页面以获取特定信息,但是我需要添加什么才能从所有页面中获取数据(不知道提前多少页)?

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url="https://portland.craigslist.org/search/sss?query=electronics&sort=date"
uClient=uReq(my_url) #sends GET request to URL 
page_html=uClient.read() #reads returned data and puts it in a variable
uClient.close() #close the connection

#create a file that we will want later to write parsed data to
filename="ScrapedData.csv"
f=open(filename, 'w')
headers="date, location, title, price\n"
f.write(headers)

#use BS to parse the webpage
page_soup=soup(page_html,'html.parser') #applying BS to the obtained html
containers=page_soup.findAll('p',{'class','result-info'})

for container in containers:
    container_date=container.findAll('time',{'class','result-date'})  
    date=container_date[0].text
    try:
        container_location=container.findAll('span',{'class','result-hood'})
        location=container_location[0].text
    except:
        try:
            container_location=container.findAll('span',{'class','nearby'})
            location=container_location[0].text
        except:
            location='NULL'
    container_title=container.findAll('a',{'class','result-title'})
    title=container_title[0].text
    try:
        container_price=container.findAll('span',{'class','result-price'})
        price=container_price[0].text
    except:
        price='NULL'

    #to print to screen
    print('date:'+date)
    print('location:'+location)
    print('title:'+title)
    print('price:'+price)
    #to write to csv
    f.write(date+','+location.replace(",","-")+','+title.replace(","," ")+','+price+'\n')

f.close()

2 个答案:

答案 0 :(得分:1)

您可以尝试通过处理网址中的"s"参数来遍历所有网页,直到找到没有结果的网页(包含文字"search and you will find"的网页):

import requests

results_counter = 0
while True:
    my_url="https://portland.craigslist.org/search/sss?query=electronics&sort=date&s=%d" % results_counter
    page_html = requests.get(my_url).text
    if "search and you will find" in page_html:
        break
    else:
        results_counter += 120

        filename="ScrapedData.csv"
        f=open(filename, 'w')
        headers="date, location, title, price\n"
        f.write(headers)

        page_soup=soup(page_html,'html.parser') #applying BS to the obtained html
        containers=page_soup.findAll('p',{'class','result-info'})
        ...

答案 1 :(得分:1)

除了安德森先生已经展示过的内容之外,你也可以为这个网站做到这一点:

import requests
from bs4 import BeautifulSoup
import csv

page_link = "https://portland.craigslist.org/search/sss?s={}&query=electronics&sort=date"
for link in [page_link.format(page) for page in range(0,1147,120)]: #this is the fix
    res = requests.get(link)
    soup = BeautifulSoup(res.text,'lxml')
    for container in soup.select('.result-info'):
        try:
            date = container.select('.result-date')[0].text
        except IndexError:
            date = ""
        try:
            title = container.select('.result-title')[0].text
        except IndexError:
            title = ""
        try:
            price = container.select('.result-price')[0].text
        except IndexError:
            price = ""

        print(date,title,price)
        with open("craigs_item.csv","a",newline="",encoding="utf-8") as outfile:
            writer = csv.writer(outfile)
            writer.writerow([date,title,price])