使用BeautifulSoup进行网页抓取时,如何移动到新页面?

时间:2018-10-23 14:04:36

标签: python pandas beautifulsoup

下面,我有将记录从craigslist中拉出的代码。一切工作都很好,但是我需要能够转到下一组记录并重复相同的过程,但是对编程来说我是新手。通过查看页面代码,看起来我应该单击此处跨度中包含的箭头按钮,直到其中不包含href:

<a href="/search/syp?s=120" class="button next" title="next page">next &gt; </a> 

我当时想这可能是一个循环,但我想这也可能是一种尝试/例外情况。听起来对吗?您将如何实施?

import requests
from urllib.request import urlopen
import pandas as pd

response = requests.get("https://nh.craigslist.org/d/computer-parts/search/syp")

soup = BeautifulSoup(response.text,"lxml")

listings = soup.find_all('li', class_= "result-row")

base_url = 'https://nh.craigslist.org/d/computer-parts/search/'

next_url = soup.find_all('a', class_= "button next")


dates = []
titles = []
prices = []
hoods = []

while base_url !=
    for listing in listings:
        datar = listing.find('time', {'class': ["result-date"]}).text
        dates.append(datar)

        title = listing.find('a', {'class': ["result-title"]}).text
        titles.append(title)

        try:
            price = listing.find('span', {'class': "result-price"}).text
            prices.append(price)
        except:
            prices.append('missing')

        try:
            hood = listing.find('span', {'class': "result-hood"}).text
            hoods.append(hood)
        except:
            hoods.append('missing')

#write the lists to a dataframe
listings_df = pd.DataFrame({'Date': dates, 'Titles' : titles, 'Price' : prices, 'Location' : hoods})

 #write to a file
listings_df.to_csv("craigslist_listings.csv")

2 个答案:

答案 0 :(得分:2)

对于您抓取的每个页面,您都可以找到要抓取的下一个网址并将其添加到列表中。

这就是我要做的,而无需过多更改您的代码。我添加了一些评论,以便您了解发生了什么,但是如果您需要任何其他说明,请给我评论:

import requests
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup


base_url = 'https://nh.craigslist.org/d/computer-parts/search/syp'
base_search_url = 'https://nh.craigslist.org'
urls = []
urls.append(base_url)
dates = []
titles = []
prices = []
hoods = []

while len(urls) > 0: # while we have urls to crawl
    print(urls)
    url = urls.pop(0) # removes the first element from the list of urls
    response = requests.get(url)
    soup = BeautifulSoup(response.text,"lxml")
    next_url = soup.find('a', class_= "button next") # finds the next urls to crawl
    if next_url: # if it's not an empty string
        urls.append(base_search_url + next_url['href']) # adds next url to crawl to the list of urls to crawl

    listings = soup.find_all('li', class_= "result-row") # get all current url listings
    # this is your code unchanged
    for listing in listings:
        datar = listing.find('time', {'class': ["result-date"]}).text
        dates.append(datar)

        title = listing.find('a', {'class': ["result-title"]}).text
        titles.append(title)

        try:
            price = listing.find('span', {'class': "result-price"}).text
            prices.append(price)
        except:
            prices.append('missing')

        try:
            hood = listing.find('span', {'class': "result-hood"}).text
            hoods.append(hood)
        except:
            hoods.append('missing')

#write the lists to a dataframe
listings_df = pd.DataFrame({'Date': dates, 'Titles' : titles, 'Price' : prices, 'Location' : hoods})

 #write to a file
listings_df.to_csv("craigslist_listings.csv")

编辑:您还忘记了在代码中导入BeautifulSoup,这是我在响应中添加的 Edit2 :您只需要查找next按钮的第一个实例,因为页面可以(在这种情况下确实)具有多个next按钮。
Edit3:为此,base_url应该更改为此代码中的那个

答案 1 :(得分:1)

这不是如何访问“下一步”按钮的直接答案,但这可能是解决您的问题的方法。过去进行网络爬虫时,我使用每个页面的URL遍历搜索结果。 在craiglist上,当您单击“下一页”时,URL会更改。您通常可以利用此更改的模式。我不必长看,但看起来craigslist的第二页是https://nh.craigslist.org/search/syp?s=120,而第三页是https://nh.craigslist.org/search/syp?s=240。似乎URL的最后部分每次更改120次。 您可以创建120的倍数的列表,然后构建一个for循环以将此值添加到每个URL的末尾。 然后,将当前的for循环嵌套在此for循环中。