使用BS的Python Web Scraping

时间:2017-12-07 23:42:23

标签: python-3.x web-scraping beautifulsoup

我有一个获取多个页面的网页抓取程序,但我必须将while循环设置为一个数字。我想创建一个条件,一旦它到达最后一页就停止循环或识别没有更多的项目要刮。假设我不知道有多少页面存在。如何在不添加随机数的情况下更改while循环条件以使其停止?

public void onSwiped(RecyclerView.ViewHolder viewHolder, int direction, int position) {
    listsRecycleViewAdapter.getRef(position).removeValue();
}

1 个答案:

答案 0 :(得分:1)

我使用while True运行无限循环,break在没有数据时退出

    data = soup.select('.result-info')
    if not data:
        print('END: no data:')
        break

我使用模块csv来保存数据,因此我不必使用replace(","," ") 如果文字中有" ",则会将文字放在,中。

s={}可以放在?之后的任何地方,所以我把它放在最后,以使代码更具可读性。

即使您使用s=0

Portal也会提供首页,因此我无需检查i == 0
(顺便说一句:在我的代码中,它具有更易读的名称offset

完整代码。

import requests
from bs4 import BeautifulSoup
import csv

filename = "output.csv"

f = open(filename, 'w', newline="", encoding='utf-8')

csvwriter = csv.writer(f)

csvwriter.writerow( ["Date", "Location", "Title", "Price"] )

offset = 0

while True:
    print('offset:', offset)

    url = "https://portland.craigslist.org/search/sss?query=xbox&sort=date&s={}".format(offset)

    response = requests.get(url)
    if response.status_code != 200:
        print('END: request status:', response.status)
        break

    soup = BeautifulSoup(response.text, 'html.parser')

    data = soup.select('.result-info')
    if not data:
        print('END: no data:')
        break

    for container in data:
        date = container.select('.result-date')[0].text

        try:
            location = container.select('.result-hood')[0].text
        except:
            try:
                location = container.select('.nearby')[0].text 
            except:
                location = 'NULL'
        #location = location.replace(","," ") # don't need it with `csvwriter`

        title = container.select('.result-title')[0].text

        try:
            price = container.select('.result-price')[0].text
        except:
            price = "NULL"
        #title.replace(",", " ") # don't need it with `csvwriter`

        print(date, location, title, price)

        csvwriter.writerow( [date, location, title, price] )

    offset += 120

f.close()